Thursday, August 28, 2025

, ,

Integrating Code Insight into Reverse Engineering Workflows

More than two years have passed since we announced the launch of Code Insight at RSA 2023. From that time on, we have been applying this technology in different scenarios, expanding its use in new file formats (12).

As we advance in the automated analysis of new files with Code Insight, we want to offer an alternative that enables the integration of this type of technology into the analysis of disassembled or decompiled code.


Audio version of this post, created with NotebookLM Deep Dive

To that end, we have created a new endpoint that receives code requests and returns a description of its functionality, highlighting the most relevant aspects for malware analysts. This endpoint can be used to query code blocks, chaining previous analyses with modifications or corrections made by the analyst. This significantly reduces the reverse engineering workload by providing the analyst with an assistant that pre-analyzes functions deemed interesting, acquiring knowledge as the analysis proceeds.

This endpoint can be integrated into any reverse engineering tool that processes disassembled or decompiled code. As an implementation example, the VirusTotal plugin for IDA Pro has been updated to support its use from the IDA interface. This offers a simple way to integrate relevant analyses into a notebook, allowing the analyst to keep responses that play a direct role in understanding how the code works.

Endpoint for reversed code queries

Using this new endpoint is quite simple—just make a request to the API as shown in the following example:

API_URL = 'https://www.virustotal.com'
endpoint = 'api/v3/codeinsights/analyse-binary'
headers_apiv3 = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'x-apikey': [API_KEY]
}

payload = {
    'code': [code_base64],
    'code_type' = ['disassembled'|'decompiled']
}

response = requests.post(f'{API_URL}/{endpoint}',
                         json = {'data': payload},
                         headers = headers_apiv3)


This Python code corresponds to a request to the endpoint located at ‘https://www.virustotal.com/api/v3/codeinsights/analyse-binary’, in which the code to be analyzed is included in the ‘payload’ variable as follows:

payload = {
    'code': code_base64,
    'code_type' = 'disassembled'|'decompiled'
    "history": [
        {
            "request": code_base64,
            "response": {
                            "summary": text,
                            "description": text,
                        },
        },
        {
            "request": code_base64,
            "response": {
                            "summary": text,
                            "description": text,
                        },
        },
    ]
}


The request is divided into two parts: the first includes the code being analyzed (‘code’ and ‘code_type’), and the second includes previous requests—potentially reviewed by the analyst—that provide context for analyzing the queried code.

This request will return a general description of how the submitted code snippet works ("summary") and, in addition, another text where it describes in more detail how these functionalities are carried out ("description"). In this way, the analyst can quickly check if the function contains any behavior that they consider interesting, and thus, review the execution steps or discard the function as irrelevant.

New version of the VT-IDA Plugin for IDA Pro

Along with this new endpoint, we have updated the VirusTotal plugin to show how this new functionality can be integrated into the analyst's workflow.

This new functionality can be used as follows:
  1. The analyst selects a function from the disassembled or decompiled code to be analyzed.
  2. If the response provided by the endpoint is satisfactory and reveals an interesting function, they can click ‘Accept’ to include it in a list of selected functions, which we call the ‘CodeInsight Notebook’. They can also make modifications to the ‘Summary’ and ‘Description’ fields to correct errors or add information that helps put the code in context.
  3. With each new request sent to the endpoint, all previously stored functions are included—along with any modifications made by the analyst. This allows for more accurate analyses based on previously obtained and reviewed results.
Here’s how the new version of the plugin would look after a few iterations on a malware sample:



A practical example

Let's illustrate the benefits of the new plugin with a practical example. Imagine an analyst needs to analyze a malicious binary file to understand its function. This is typically a time-consuming and complex process, but with the help of Code Insight, their workflow becomes significantly more efficient:

  1. Targeted Analysis: The analyst selects a code block they suspect might be malicious and uses the endpoint to get an automated analysis.

    The code shown below implements an anti-disassembly technique aimed at generating disassembled code that hides malicious functionality through a hidden jump to a memory address. Essentially, the resulting disassembled code is unreliable, as it doesn’t accurately represent the code that will actually be executed.



  2. Review and Refinement: At this point, a request is made to obtain an initial analysis of the code. The analyst reviews the response and can modify both the ‘Summary’ and ‘Description’ fields with their own notes or corrections.

  3. In this case, the obtained code analysis correctly identifies an anti-disassembly technique that modifies the return address. However, it does not provide information about a possible return address that would help the analyst locate the hidden code.

    At this point, the analyst can modify the output provided by the endpoint to explain how this technique works. This way, the acquired knowledge can be used in the analysis of other code blocks within the sample. To do so, the analyst simply needs to include the (reviewed) analysis in the list of analyzed functions by clicking the ‘Accept’ button.



  4. Iterative Analysis and Improved Results: The file analysis continues in such a way that, with each new request, the list of analyzed functions is sent—effectively representing the knowledge acquired from analyzing the code selected by the analyst.


And as shown in the previous image, this knowledge is used in other function queries that employ a technique similar to the one previously discussed—this time providing more details about how it works and alerting the analyst to the possibility of jumping to an address containing hidden code.

Quick Tips

The endpoint offers some interesting features for the analyst. For example, as shown in the following figure, the presence of strings written in languages other than English has been detected, providing a translation and pinpointing their location in memory.



On the other hand, while analyzing assembly code has its own pros and cons compared to decompiled code, we can gain additional benefits by analyzing a decompiled function whose disassembled code has been previously analyzed and stored in Code Insight Notebook.

For example, let's look at the decompiled code of a function previously analyzed in its disassembled version:


The image below illustrates how analyzing a decompiled function becomes richer with the help of the previously stored analysis of its disassembled code. This happens because certain features, like text strings, are visible in the disassembled code but often missing from the decompiled version.

As a result, Code Insight can provide a more concise and direct explanation by leveraging the decompiled view, which is supported by the disassembled code.



It is important to highlight that both the endpoint and this new feature of the plugin for IDA Pro are offered in trial mode, with the aim of involving the community in the progress we are making in its application to the field of reverse engineering. Although the results produced by this new functionality have been very positive during the testing phase, it is possible that the output generated by the endpoint may not be 100% accurate and could contain errors or omit some relevant details of the analysis.

We are confident that this new integration will be a great help to analysts who are gradually incorporating LLM model capabilities into their workflow. As we continue to harness the power of AI, your feedback is incredibly valuable to us. Stay connected for future updates, and thank you for your continued support.


Monday, August 25, 2025

Applying AI Analysis to PDF Threats

In our previous post we extended VirusTotal Code Insights to browser extensions and supply-chain artifacts. A key finding from that analysis was how our AI could apply contextual knowledge to its evaluation. It wasn’t just analyzing code in isolation, it was correlating a package's stated purpose (its name and description) with its actual behavior, flagging malicious logic that contradicted its public description. We’re now applying the same idea to one of the most common file formats in the world, the PDF.


Audio version of this post, created with NotebookLM Deep Dive

PDFs are multi-layered. There’s the object tree (catalog, pages, objects, streams, actions, embedded files) and there’s the visible layer (text/images the user reads). Code Insights analyzes both, then correlates: does the document content, claims, and branding make sense given its internal behaviors? That lets us surface not only classic PDF exploitation (e.g., auto-actions, JS, external launches) but also pure social engineering (phishing, vishing, QR-lures) even when the file has no executable logic. This dual approach allows the AI not only to detect malicious code but also to identify sophisticated scams.

Let's look at real-world samples surfaced by Code Insights during its initial testing phase. We'll start with cases where the PDF contains no malicious code, which traditional engines often miss because there's no executable payload to detect. This is where Code Insights proves useful, identifying clear signs of fraud and social engineering that aim to manipulate the user, not the machine.


Case 1 - Fake debt collection targeting financial fraud

This PDF is a real-world sample sent to VirusTotal and captured by Code Insights during early testing. It was flagged as malicious based entirely on its visible content, without relying on any embedded code or execution logic. The file was marked as clean by all other engines, likely because it contains no scripts, exploits, or embedded payloads.

d92a1a7460c580f8bf6af3cbd39c7840cfe6a146ee15ede8e23c50c2a85becb9

The document pretends to be a debt collection notice from a German agency acting on behalf of Amazon. It includes a formal layout, legal threats, payment instructions, and multiple references to German addresses and regulations. Visually, it looks legitimate.


However, the AI flagged it as fraudulent based on several critical inconsistencies, the most important one being the destination bank account. The payment is requested to an IBAN starting with BG, indicating a Bulgarian account. This contradicts the sender's claimed German identity and would be highly unusual for a legitimate German debt agency. This mismatch alone was enough for Code Insights to classify the file as fraudulent. Additional content cues (urgent tone, fee breakdown, legal pressure) support the assessment.

As described in the Code Insights analysis:

“The visual and textual content confirms the document is a sophisticated phishing attack. It masquerades as an urgent payment demand from a German debt collection agency, supposedly on behalf of Amazon. The document employs high-pressure tactics, including threats of legal action, additional fees, and credit score damage, to compel the recipient to act quickly. The primary and most conclusive indicator of fraud is the demand for payment to a Bulgarian bank account, which is a stark and highly irregular contradiction to the agency's purported German location and registration.”

This is a case where AI adds value by reasoning over the content semantics, not the file structure.


Case 2 - QR-based phishing (quishing) campaign

This is another real-world PDF captured during early testing of Code Insights. At the time of analysis, no antivirus or malware detection engines flagged the file as malicious. The PDF has no embedded scripts, exploits, or execution logic. From a technical perspective, it looks benign.

259e202847d04866acd76427f53bfd9a15372ed6ed56a9e54ba1c62442c945ee

The visible content, however, impersonates an HR notification about a salary increase. It includes multiple social engineering red flags: awkward grammar, lack of personalization, and an irrelevant privacy disclaimer. The only call to action is a QR code, encouraging the recipient to scan it for more details.


Code Insights analyzed and decoded the QR, extracting the hidden URL. The domain is non-corporate and clearly unrelated to HR or payroll systems. The combination of deceptive HR messaging with a QR code that conceals a phishing URL confirms the document is a credential harvesting fraud delivered via PDF.


Case 3 - Vishing via fake PayPal alert

This is another real-world PDF flagged by Code Insights during early evaluation. No antivirus or malware detection engines classified the file as malicious. Structurally, it’s simple and inert: there are no scripts, automatic actions, or embedded links. Minor stream decoding errors are present but considered low-risk anomalies.

d0bedc70085efff5218b901cdaba95d565df867495181544041ba4b8a6019cea


The threat lies entirely in the content. The document impersonates PayPal and trusted brands like Visa to deliver a fake security alert about a high-value unauthorized purchase. The language is urgent and designed to induce panic.

According to Code Insights:

“[...]the visual content of the document is a clear social engineering lure designed for a voice phishing (vishing) attack. [...] The document's sole purpose is to persuade the user to call a specific phone number under the pretense of canceling the fraudulent order. The malicious nature is confirmed by several red flags, including an awkwardly phrased greeting and a phone number with a geographic area code (808) that is deceptively labeled as "Toll-Free." This tactic aims to route the victim to a scammer for social engineering and potential fraud.”


Case 4 - Fake Tax Refund from the Australian Taxation Office

As with previous cases, this PDF wasn’t flagged by any antivirus engine in VirusTotal, but Code Insights identified it as a phishing lure that impersonates the Australian Taxation Office.

b9b763e4b091bc59e9b9f355617622dbabdc1ff2de6707a94ccb26aa7682300e


As described by Code Insights:

“This document is a phishing lure designed to impersonate the Australian Taxation Office (ATO). The visual layer uses an authentic-looking government logo and the promise of a tax refund to entice the recipient into clicking an "Access Document" button. The purpose is to have the user provide an electronic signature for a supposed refund authorization, creating a sense of urgency and financial incentive. The document exhibits multiple red flags common to phishing attacks. These include a generic greeting, a suspicious reference to a .doc file (a common malware vector), instructions that discourage direct replies, and a complete lack of legitimate contact information or alternative methods for verification. The entire premise relies on tricking the user into clicking the button, which likely leads to a malicious website for credential theft or malware download.”


Auto-executing PDF Posing as a Movie Download

Unlike previous examples, this PDF was flagged by 13 antivirus engines in VirusTotal. In this case, the attack is embedded both in the internal structure of the file and its visual appearance. Code Insights correlates these two layers, the technical and the social, to expose the malicious intent.

44e653fe79d1ab160c784c06f4d99def6419e379ef3f802af9f48d595976d2c7


As described by Code Insights:

“The document presents a social engineering lure, masquerading as a download page for pirated movies […] to entice users into clicking links. This theme, centered on illegal content distribution, is a common tactic for malware delivery. Technical analysis of the PDF's internal structure corroborates the malicious intent. The file is configured with an /OpenAction command, a high-risk feature designed to automatically execute an action upon the document being opened […] The combination of a deceptive, high-risk theme with an automatic execution function indicates that the document’s purpose is to compromise the user's system.”

We are actively improving Code Insight based on what we learn from these early cases. PDF is the 6th most common file type submitted to VirusTotal, with around 100,000 new samples uploaded every day. That volume requires us to be strategic: for now, only a selected percentage of PDF files submitted via the public web interface are processed by Code Insight, as we test, tune, and scale the system.

These first results are helping us refine both effectiveness and performance. We’ll continue expanding coverage as we improve detection of threats.

Thursday, August 14, 2025

Code Insight Expands to Uncover Risks Across the Software Supply Chain

When we launched Code Insight, we started by analyzing PowerShell scripts. Since then, we have been continuously expanding its capabilities to cover more file types. Today, we announce that Code Insight can now analyze a broader range of formats crucial to the software supply chain. This includes browser extensions (CRX for Chrome, XPI for Firefox, VSIX for VS Code), software packages (Python Wheel, NPM), and protocols like MCP that enable Large Language Models to interact with external tools.


Audio version of this post, created with NotebookLM Deep Dive

Attackers are increasingly targeting these formats to distribute malware, steal data, or compromise systems. Traditional detection methods, which often rely on signatures or machine learning focused on classification, can struggle to keep up with the dynamic and obfuscated nature of these threats. This is where AI can make a real difference. By analyzing the underlying code logic, Code Insight can identify malicious behavior even in previously unseen threats, providing a deeper level of security analysis.

This is particularly relevant in a landscape where even a single malicious browser extension can lead to significant data breaches, financial loss, or the compromise of corporate networks.


A Viral Tweet and a Real-World Example

In the last few hours, a tweet from a seasoned crypto user (zak.eth) went viral, narrating how his wallet was drained by a malicious extension for the first time in over ten years of activity. This incident is a stark reminder that anyone can be a target.


This is a prime example of where Code Insight can be instrumental. It can analyze one of the suspicious extensions mentioned in the thread and reveal its malicious nature:

From here, we will explore different examples of the new formats supported by Code Insight and specific examples where traditional engines fail to detect a threat.


CRX (Chrome Extensions)

CRX files are the format used for packaging Google Chrome browser extensions. While they can enhance browsing, they also represent an attack vector if they contain malicious code. Here is an example of a seemingly legitimate "Norton Safe Search" extension. However, Code Insight's analysis reveals its true, malicious purpose:

6ca4466baf5ff09bab90a5d06bf113667717400daa59a287393e8f3f10959aba

The extension is obfuscated to hide its true purpose. The code in js/background.js communicates with a command and control (C2) server located at a domain unrelated to Norton. The most critical malicious behavior is its capability to fetch and execute arbitrary code from the C2 server. This allows the attacker to dynamically change the extension's functionality after installation, effectively turning the user's browser into a bot.

In another case, a banking trojan targeting Westpac customers was identified:

34244257f633e104d06b0c4273caca96eb916d26540eeea68495707cbc920bdb

This extension is a banking trojan specifically targeting Westpac customers. It operates as a Man-in-the-Browser (MitB) malware to steal credentials, session data, and funds. It establishes a persistent WebSocket connection to a hardcoded C2 server, collects all cookies from the browser and intercepts form submissions, specifically targeting the input field for the 'AuthorisationCode' (a 2FA/OTP token).


VSIX (Visual Studio Code Extensions)

VSIX files are used for extensions in Visual Studio Code, a popular code editor. Developers can be targeted through these extensions, potentially compromising their development environment and projects.

A deceptive "Zoom" extension for VS Code was found to be stealing user data:

5c89ba9e1bbb7ef869e4553081a40cabbd91a70506d759fd4e97eefb0434c074

The extension attempts to access sensitive user data by reading browser cookies from a known local SQLite database file. It also includes functionality to make external network requests to an unusual domain. which could be used to exfiltrate the collected sensitive data. This combination of local data collection and external communication is an indicator of malicious intent, specifically information theft.


XPI (Firefox Extensions)

XPI files are used for Firefox browser add-ons. Similar to Chrome extensions, they can be used to distribute malware.

A "Mass Tiktok Video Downloader" extension was found to be a phishing and data exfiltration tool:

2c0c8bd05a4942b389feaeb02c372b6443efac9d0931e0bdc602474178b54e7f

It presents a fake Facebook password confirmation popup to phish user credentials. Concurrently, its background script actively collects all browser cookies. All collected data, including the phished passwords, are exfiltrated to a Telegram bot API endpoint.


WHL (Python Wheel)

WHL files are a standard for distributing Python packages. The threats in these examples are not limited to intentionally malicious code, it also includes packages with critical vulnerabilities or insecure coding patterns that can be exploited in supply chain attacks.

An "hh-applicant-tool" designed to interact with an API was found to have a suspicious telemetry feature:

1a168e47cb2d81f54fe504e66e353251a772164959ec71517d2070bf96fee957

It collects data, including vacancy details, employer information, and Google Docs links found in messages, and sends it to a custom server. This communication explicitly disables SSL certificate verification (verify=False), making the data transfer vulnerable to Man-in-the-Middle attacks.

In another instance, a package named "ncatbot" contained a critical security vulnerability:

f2714f6b87689c4d631a587813d14c4e463be7251bf16ff383ad2b7940ca7a4d

A critical security vulnerability exists in the Linux installation process, which executes a remote script with root privileges using curl | sudo bash. This allows for arbitrary code execution and system compromise if the remote script is malicious or its source is compromised.


NPM (Node Package Manager)

NPM is the default package manager for Node.js and is central to the JavaScript ecosystem. Malicious NPM packages are a constant threat to developers and applications.

A package named "serverless-shop-functions" presented as a benign e-commerce application but contained two malicious Python scripts:

8f7a061901c935493e17f3f897a2b98b5ab21350593fda10a6936a84db5b28b7

Backdoor.Python.PolymorphNecro.h is identified as a polymorphic IRC botnet client. Its capabilities include: network sniffing, ARP poisoning, various DDoS attack methods. Main.py is a Discord-controlled Remote Access Trojan (RAT) with extensive capabilities, including: establishing persistence, executing arbitrary PowerShell commands, capturing and exfiltrating screenshots and webcam photos.


PyPI (Python Package Index)

PyPI is the official third-party software repository for Python. It's a common target for attackers looking to distribute malicious packages. However, the threat also comes from packages that, while not intentionally malicious, contain critical vulnerabilities in their design.

A package named python-mcp-client was found to have severe vulnerabilities allowing for remote code execution:

83c4c8d38e3eea555666e26ed85953b7479d46d9b4d2c12c521ae5f505b343d2

The package exposes severe vulnerabilities that allow for remote code execution (RCE) and arbitrary file system operations. The flask_app.py component allows users to dynamically add new MCP servers via the /api/add_server endpoint. This endpoint directly accepts user-provided command and args parameters, enabling an attacker to execute arbitrary shell commands on the host system.

By expanding Code Insight's capabilities, we aim to provide the cybersecurity community with a tool to better understand and mitigate the evolving threats within the software supply chain. Stay tuned as we continue to enhance our platform to counter new attack vectors.