Wednesday, October 08, 2025

Simpler Access for a Stronger VirusTotal

VirusTotal (VT) was founded on a simple principle: we are all stronger when we work together. Every file shared, every engine integrated, and every rule contributed strengthens our collective defense against cyber threats.

In the spirit of that collaboration, and in light of recent community discussions, we want to share our vision for the future of the platform. We have heard your feedback on the need for simplicity and accessibility, and we are taking action. VT will continue to be broadly available with straightforward options, including a robust free tier for our contributors and community.

Our commitment is to ensure the long-term health and openness of the platform. To do that, we are focused on three key goals:

  • Preserve VT as an open, collaborative platform built for the common good.
  • Provide our contributors with a reliable, cost-effective, and long-term framework for partnership.
  • Improve access to advanced features for academics, researchers, and defenders dedicated to public service.

Today, Google Threat Intelligence offers new ways to access advanced and curated threat intelligence, powered by the combined intelligence of VT, Mandiant and Google. As part of this broader evolution, we’re making sure VT remains open and transparent, while offering flexible options that meet the needs of our diverse users, from security researchers and startups to MSSPs and other security vendors.

VT now offers simpler pricing with tiers optimized for our partner contributors and community. We’re also introducing a Contributor Tier, a dedicated model for our engine partners. It ensures continuous access to VT feeds, priority support, and early access to new features. This tier recognizes their essential role in keeping VirusTotal open, collaborative, and globally impactful.

Key Access Tiers
Tier For Who Key Features Annual Price
VT Community Individual researchers, academics, educators. File scanning, URL scanning, public API, community features. Free.
VT Contributor Technological partners contributing detection engines. Feed of blindspots for free and discounts based on contribution tiers. From free (feed of blindspots) upon program acceptance.
VT Lite Small teams, early-stage startups, small MSSPs, SMB. Non-commercial. Advanced search, YARA hunting, File downloading, Private API, Private Scanning. Low-moderate usage. From $5k for low API volumes.
VT Duet Large organizations. Full feature set, high API quota. Community Intelligence only. Based on number of affiliates covered and contribution level.

You’ll notice that security vendors who do not contribute detections are not included in these tiers, as we are reaffirming our long-standing 2016 commitment to a healthy community. We welcome any organization to become a contributor and join us in protecting the common good. If you want to contribute, please let us know.

While Google Threat Intelligence will continue to deliver advanced threat context for enterprise customers, VirusTotal will always remain the collaborative, transparent, and community-driven foundation.

Thank you for helping us make this possible. We’re here to build the next chapter with you, not just for you.

Bernardo Quintero
Founder of VirusTotal

Wednesday, October 01, 2025

Crowdsourced AI += Exodia Labs

We’re adding a new specialist to VirusTotal’s Crowdsourced AI lineup: Exodia Labs, with an AI engine focused on analyzing Chrome extension (.CRX) files. This complements our existing Code Insight and other AI contributors by helping users better understand this format and detect possible threats.

What you get in VirusTotal

  • Second opinion for .CRX: Exodia Labs adds another AI analysis stream alongside Code Insight. It gives a fresh, independent view on the same sample type. Like all Crowdsourced AI engines, it’s meant to complement (not replace) traditional detections and human analysis.
  • Clear verdict in the UI: Each Exodia report includes a simple verdict (benign, suspicious, or malicious) to help you quickly spot risky extensions.
  • Searchable results in VT Intelligence: You can now use new operators to search and pivot across Exodia Labs results:
    • exodialabs_ai_verdict:malicious | suspicious | benign
    • exodialabs_ai_analysis:<keywords>

See it in action

Here are a few Exodia Labs AI report examples you can explore in VT:

31da559ae4af91106e0a18740d6bb8916e2017f6a37a02ea2a8127f1da30ec77

69c926ea84536bdaba7e4f765bde65eb0199ac30be3a96729a21ea7efa48d721

You can also explore Exodia Labs verdicts at scale using VirusTotal Intelligence.

For example, the following query lists Chrome extensions flagged as malicious and related to financial activity: exodialabs_ai_verdict:malicious AND exodialabs_ai_analysis:financial


This search shows several .CRX files where Exodia Labs AI detected suspicious financial behavior.

Let’s look at two examples:

  • Westpac Extension: Exodia Labs flags it as malicious. The AI analysis shows the extension connects to a remote WebSocket server and exfiltrates cookies, one-time passwords, and payment tokens. It manipulates banking pages and forwards captured credentials to a C2, showing signs of credential theft and financial data tampering.
    34244257f633e104d06b0c4273caca96eb916d26540eeea68495707cbc920bdb

  • Spidy Extension: Also flagged as malicious. The analysis shows it requests and cookies permissions, executes remote crawling jobs, and collects user profile and bank account details. The extension behaves like a data-exfiltration client handling financial credentials not mentioned in its public description.
    718eab32b5597e479d63f1d4e6402b7844eb9a4ee01c9028e44eb202d5ebcb2f

About Exodia Labs

Exodia Labs builds AI-driven analysis for Chrome Web Store extensions, also exposing a browser add-on that lets users request an AI assessment directly from an extension’s store page and view a detailed report plus a verdict. For security teams, the same analysis powers the backend results we index in VirusTotal.

Join Crowdsourced AI

Crowdsourced AI is about aggregating independent AI solutions that explain behavior and provide judgments across many file types, helping you understand unfamiliar code faster and spot novel threats sooner. If you build AI solutions that can help the community, we want to hear from you.

Tuesday, September 30, 2025

, , ,

Advanced Threat Hunting: Automating Large-Scale Operations with LLMs

Last week, we were fortunate enough to attend the fantastic LABScon conference, organized by the SentinelOne Labs team. While there, we presented a workshop titled 'Advanced Threat Hunting: Automating Large-Scale Operations with LLMs.' The main goal of this workshop was to show attendees how they could automate their research using the VirusTotal API and Gemini. Specifically, we demonstrated how to integrate the power of Google Colab to quickly and efficiently generate Jupyter notebooks using natural language.

It goes without saying that the use of LLMs is a must for every analyst today. For this reason, we also want to make life easier for everyone who uses the VirusTotal API for research.

The Power of the VirusTotal API and vt-py

The VirusTotal API is the programmatic gateway to our massive repository of threat intelligence data. While the VirusTotal GUI is great for agile querying, the API unlocks the ability to conduct large-scale, automated investigations and access raw data with more pivoting opportunities.

To make interacting with the API even easier, we recommend using the vt-py library. It simplifies much of the complexity of HTTP requests, JSON parsing, and rate limit management, making it the go-to choice for Python users.

From Natural Language to Actionable Intelligence with Gemini

To bridge the gap between human questions and API queries, we can leverage the integrated Gemini in Google Colab. We have created a "meta Colab" notebook that is pre-populated with working real code snippets for interacting with the VirusTotal API to retrieve different information such as campaigns, threat actors, malware, samples, URLs among others (which we will share soon). This provides Gemini with the necessary context to understand your natural language requests and generate accurate Python code to query the VirusTotal API. Gemini doesn't call the API directly; it creates the code snippet for you to execute.

For Gemini to generate accurate and relevant code, it needs context. Our meta Colab notebook is filled with examples that act as a guide. For complex questions, it will be nice to provide the exact field names that you want to work with. This context generally falls into two categories:

  1. Reference Documentation: We include detailed documentation directly in the Colab. For example, we provide a comprehensive list of all available file search modifiers for the VirusTotal Intelligence search endpoint. This gives Gemini the "vocabulary" it needs to construct precise queries.
  2. Working Code Examples: The notebook is pre-populated with dozens of working vt-py code snippets for common tasks like retrieving file information, performing an intelligence search, or getting relationships. This gives Gemini the "grammar" and correct patterns for interacting with our API.

Example of code snippet context that we have included in our meta colab:

query_results_with_behaviors = []
query = "have:sigma have:yara have:ids have:malware_config fs:1d+ have:bundled_file tag:overlay"
RELATIONS = "behaviours"

async for itemobj in cli.iterator('/intelligence/search',params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): '\*'},limit=0):
    query_results_with_behaviors.append(itemobj.to_dict())

res_with_behaviors = list(query_results_with_behaviors)
Diagram showing the flow from User's Action to Actionable Intelligence via Gemini and VirusTotal API

Practical Example: Hunting for Malicious LNK Files

One of the exercises in our workshop focused on analyzing malicious LNK files submitted to VirusTotal in a specific period of time to limit results. The goal was to discover common execution techniques and interesting command-line patterns.

Here’s how you could approach this with natural language queries in our meta Colab:

1. Retrieve all LNK file objects matching the criteria.

You can start by asking Gemini:

Create a code snippet to retrieve all LNK files submitted from January 1st, 2025, until April 1st, 2025, with 5 or more positives.

Code generated:

query_results = []

# Query for LNK files submitted between January 1st, 2025, and April 1st, 2025, with 5 or more positives.
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

async for itemobj in cli.iterator('/intelligence/search',params={'query': "%s"%(query)},limit=0):
    query_results.append(itemobj.to_dict())

lnk_files = list(query_results)
print(f"Found {len(lnk_files)} LNK files matching the criteria.")

2. Flatten the data and visualize it.

Next, you can ask Gemini to process the retrieved data and create visualizations:

Flatten the data in a dataframe and create a bar chart depicting the most common TargetFileDOSName values.

Code generated:

import pandas as pd
import plotly.express as px

# Extract the 'attributes' dictionary from each object in our results
attributes_list = [result.get('attributes', {}) for result in lnk_files]

# Use json_normalize to create a flat DataFrame
df_flat = pd.json_normalize(attributes_list)

# Check if 'exiftool.TargetFileDOSName' column exists and is not empty
if 'exiftool.TargetFileDOSName' in df_flat.columns and not df_flat['exiftool.TargetFileDOSName'].empty:
    # Count the occurrences of each TargetFileDOSName
    target_file_counts = df_flat['exiftool.TargetFileDOSName'].value_counts().reset_index()
    target_file_counts.columns = ['TargetFileDOSName', 'Count']

    # Get the top 10 most common TargetFileDOSName values for visualization
    top_target_files = target_file_counts.head(10)

    # Create an interactive bar chart
    fig = px.bar(top_target_files,
                 x='Count',
                 y='TargetFileDOSName',
                 orientation='h',
                 title='Top 10 Most Common TargetFileDOSName in LNK files',
                 labels={'Count': 'Number of Occurrences', 'TargetFileDOSName': 'Target File DOS Name'})
    fig.show()
else:
    print("The 'exiftool.TargetFileDOSName' column was not found or is empty in the flattened data.")

print("DataFrame shape:", df_flat.shape)
df_flat.head()

This will generate a code snippet using pandas and Plotly to give you a clear view of the most frequent target files in the malicious LNKs.

3. Visualize relationships with a Sankey diagram.

To dig deeper and understand the relationships between the infected machines and the malware, you can ask for a more advanced visualization:

Create a Sankey diagram connecting MachineID (source) to TargetFileDOSName (target), with link widths representing the number of occurrences. Include only MachineID values with 10 or more occurrences, grouping the rest as "Other", and make the visualization interactive with hover details.

Code generated:

import plotly.graph_objects as go
import pandas as pd

# Ensure df_flat is available from previous steps
# If not, you might need to re-run the cell that creates df_flat.

# Check if necessary columns exist in df_flat
if 'lnk_info.machine_id' in df_flat.columns and 'exiftool.TargetFileDOSName' in df_flat.columns:

    # Calculate the frequency of each machine ID and target file name pair
    pair_frequencies = df_flat.groupby(['lnk_info.machine_id', 'exiftool.TargetFileDOSName']).size().reset_index(name='frequency')

    # Identify MachineIDs with less than 10 occurrences
    machine_id_counts = pair_frequencies.groupby('lnk_info.machine_id')['frequency'].sum().reset_index(name='count')
    other_machine_ids = machine_id_counts[machine_id_counts['count'] < 10]['lnk_info.machine_id'].tolist()

    # Replace infrequent MachineIDs with 'Other'
    pair_frequencies['source_node'] = pair_frequencies['lnk_info.machine_id'].apply(lambda x: 'Other' if x in other_machine_ids else x)

    # Aggregate frequencies for the 'Other' node and other valid nodes
    sankey_data = pair_frequencies.groupby(['source_node', 'exiftool.TargetFileDOSName'])['frequency'].sum().reset_index()

    # Create unique labels for the Sankey nodes
    all_nodes = pd.concat([sankey_data['source_node'], sankey_data['exiftool.TargetFileDOSName']]).unique()
    label_to_index = {label: i for i, label in enumerate(all_nodes)}

    # Create the Sankey diagram
    fig = go.Figure(data=[go.Sankey(
        node=dict(
            pad=15,
            thickness=20,
            line=dict(color="black", width=0.5),
            label=all_nodes,
            color="blue"
        ),
        link=dict(
            source=sankey_data['source_node'].map(label_to_index),
            target=sankey_data['exiftool.TargetFileDOSName'].map(label_to_index),
            value=sankey_data['frequency'],
            label=sankey_data.apply(lambda row: f"{row['source_node']} -> {row['exiftool.TargetFileDOSName']}: {row['frequency']}", axis=1)
        )
    )])

    fig.update_layout(title_text="MachineID to TargetFileDOSName Sankey Diagram (Infrequent MachineIDs grouped as 'Other')", font_size=10)
    fig.show()
else:
    print("Required columns 'lnk_info.machine_id' or 'exiftool.TargetFileDOSName' not found in the DataFrame.")

This will create a powerful visualization that shows how different machines are being targeted with specific files, allowing us identify patterns for each machineID metadata.

Sankey diagram

4. Analyze the behavior of the LNK files.

Finally, you can investigate the behavior of the LNK files themselves, for example, by looking at how they are executed:

Create a pie chart depicting the distribution of window behaviors for LNK target processes (hidden, minimized,...).

Code generated:

import plotly.express as px
import pandas as pd

# Check if the relevant column exists
if 'lnk_info.header.show_window_str' in df_flat.columns and not df_flat['lnk_info.header.show_window_str'].empty:
    # Count the occurrences of each window behavior string
    window_behavior_counts = df_flat['lnk_info.header.show_window_str'].value_counts().reset_index()
    window_behavior_counts.columns = ['Window Behavior', 'Count']

    # Create an interactive pie chart
    fig = px.pie(window_behavior_counts,
                 names='Window Behavior',
                 values='Count',
                 title='Distribution of LNK Target Process Window Behaviors',
                 hover_data={'Count': True})
    fig.show()
else:
    print("The 'lnk_info.header.show_window_str' column was not found or is empty in the flattened data. Cannot create the pie chart.")
Pie chart of window behaviors

This allows you to quickly identify if the LNK files are trying to hide their execution, a common use case for malware.

5. Geographic Threat Mapping of Submissions

Understanding the geographical distribution of malware submissions can provide crucial insights into whether a campaign is widespread or targeting a specific region. By analyzing the submission metadata for our set of LNK files, an analyst can quickly identify potential hotspots, which could indicate a geographically focused attack. This is achieved by fetching the submissions relationship for each file, extracting the country of origin for each submission, and plotting this data on a world map (please note that submissions doesn’t mean targeted countries, but in some cases could be related).

Using the LNK files identified previously, create a code snippet that retrieves the submissions relationship for each file. Then, count the number of submissions from each country and display the results on a choropleth world map to visualize the submission hotspots.

In this case, Gemini uses the step-by-step execution, which basically is a set of consecutive executions. The code generated was:

Step 1
query_results_with_submissions = []

# Define the search query string for LNK files
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

# Define the relationship to request
RELATIONS = "submissions"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_submissions.append(itemobj.to_dict())

# Convert query_results_with_submissions to a list called lnk_files_with_submissions
lnk_files_with_submissions = list(query_results_with_submissions)

# Print the number of LNK files found with submission data.
print(f"Found {len(lnk_files_with_submissions)} LNK files with submission data matching the criteria.")
Step 2
import pandas as pd

# Initialize an empty list to store submission countries
submission_countries = []

# Iterate through each item (LNK file) in the lnk_files_with_submissions list
for item in lnk_files_with_submissions:
    # Check if 'relationships' and 'submissions' exist and are not empty
    if 'relationships' in item and 'submissions' in item['relationships'] and item['relationships']['submissions']['data']:
        # Iterate through each submission object
        for submission in item['relationships']['submissions']['data']:
            # Extract the 'country' from 'submission['attributes']' and append it to the list
            if 'country' in submission['attributes']:
                submission_countries.append(submission['attributes']['country'])

# Convert the submission_countries list into a pandas Series
submission_countries_series = pd.Series(submission_countries)

# Count the occurrences of each country and store in a DataFrame
country_counts = submission_countries_series.value_counts().reset_index()
country_counts.columns = ['country', 'count']

print(f"Total unique countries found: {len(country_counts)}")
print("Top 5 submission countries:")
print(country_counts.head())
Step 3
!pip install pycountry
import plotly.express as px
import pandas as pd
import pycountry

# Function to get ISO-3 code from country name
def get_iso3(country_name):
    try:
        return pycountry.countries.search_fuzzy(country_name)[0].alpha_3
    except:
        return None

# Apply the function to get ISO-3 codes
country_counts['iso_alpha'] = country_counts['country'].apply(get_iso3)

# Create a world heatmap using Plotly Express with ISO-3 codes
fig = px.choropleth(country_counts.dropna(subset=['iso_alpha']), # Drop rows where ISO code couldn't be found
                      locations="iso_alpha", # Use ISO-3 codes for locations
                      locationmode='ISO-3',   # Specify ISO-3 location mode
                      color="count",
                      hover_name="country",
                      color_continuous_scale="Viridis",
                      title="Submission Country Heatmap")

# Update layout to increase figure size
fig.update_layout(
    autosize=False,
    width=1000, # Adjust width as needed
    height=600 # Adjust height as needed
)

fig.show()
World heatmap of submissions

More Examples

Here are some other examples of how you can use natural language to query the VirusTotal API:

  • Find malicious Chrome extensions (based on this blog):

This specific question is actually a really good example of how Google Colab will run the step-by-step process. As you can see, we asked different questions in the same prompt, for that reason Gemini will do it in different steps.

Retrieve all the samples from the /intelligence/search endpoint with the following query: fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight.

Get the bundled_files and create a new dataframe with them too, but store only those having 5 or more malicious value.

Then identify the top 5 magika of those bundled files obtained

Code generated:

Step 1
query_results_with_bundled_files = []

# Define the search query string
query = "fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight"

# Define the relationship to request
RELATIONS = "bundled_files"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_bundled_files.append(itemobj.to_dict())

# Convert query_results_with_bundled_files to a list called crx_files_with_bundled_files
crx_files_with_bundled_files = list(query_results_with_bundled_files)

# Print the number of found CRX files with bundled file data.
print(f"Found {len(crx_files_with_bundled_files)} CRX files with bundled file data matching the criteria.")
Step 2
import pandas as pd

# List to store all bundled file attributes
all_bundled_files_data = []

# Iterate through each CRX file with bundled file data
for crx_file in crx_files_with_bundled_files:
    if 'relationships' in crx_file and 'bundled_files' in crx_file['relationships']:
        for bundled_file in crx_file['relationships']['bundled_files']['data']:
            all_bundled_files_data.append(bundled_file.get('attributes', {}))

# Create a DataFrame from the extracted bundled file attributes
df_bundled_files = pd.json_normalize(all_bundled_files_data)

# Filter for bundled files with 5 or more malicious detections
df_malicious_bundled_files = df_bundled_files[df_bundled_files['last_analysis_stats.malicious'] >= 5]

print(f"Total bundled files extracted: {len(df_bundled_files)}")
print(f"Bundled files with 5 or more malicious detections: {len(df_malicious_bundled_files)}")

df_malicious_bundled_files.head()
Step 3
import pandas as pd
import plotly.express as px

# Check if the 'magika' column exists and is not empty in the filtered DataFrame
if 'magika' in df_malicious_bundled_files.columns and not df_malicious_bundled_files['magika'].empty:
    # Count the occurrences of each magika type
    magika_counts = df_malicious_bundled_files['magika'].value_counts().reset_index()
    magika_counts.columns = ['Magika Type', 'Count']

    # Get the top 5 most frequent magika types
    top_5_magika = magika_counts.head(5)

    print("Top 5 Magika Types in malicious bundled files:")
    print(top_5_magika)

    # Visualize the top 5 magika types
    fig = px.bar(top_5_magika,
                 x='Count',
                 y='Magika Type',
                 orientation='h',
                 title='Top 5 Magika Types in Malicious Bundled Files',
                 labels={'Count': 'Number of Occurrences', 'Magika Type': 'Magika Type'}) 
    fig.update_layout(yaxis={'categoryorder':'total ascending'}) # Order bars by count
    fig.show()
else:
    print("The 'magika' column was not found or is empty in the filtered malicious bundled files DataFrame. Cannot identify top magika types.")
  • Retrieve threat actors:
Retrieve threat actors targeting the United Kingdom with an espionage motivation. Sort the results in descending order of relevance. Display the total number of threat actors and their names.
  • Investigate campaigns:
Retrieve information about threat actors and malware involved in campaigns targeting Pakistan. For each threat actor, retrieve its country of origin, motivations, and targeted industries. For each malware, retrieve its name.

What’s next

This workshop, co-authored with Aleksandar from Sentinel LABS, will be presented at future conferences to show the community how to get the most out of the VirusTotal API. We'll be updating the content of our meta colab regularly and will share more information soon about how to get the Google Colab.

In the meantime, if you have any feedback or ideas to contribute, we are open to suggestions.

Friday, September 05, 2025

, ,

Supercharging Your Threat Hunts: Join VirusTotal at Labscon for a Workshop on Automation and LLMs

We are excited to announce that our colleague Joseliyo Sánchez, will be at Labscon to present our workshop: Advanced Threat Hunting: Automating Large-Scale Operations with LLMs. This workshop is a joint effort with SentinelOne and their researcher, Aleksandar Milenkoski. 

In today's rapidly evolving threat landscape, security professionals face an overwhelming tide of data and increasingly sophisticated adversaries. This hands-on workshop is designed to empower you to move beyond the traditional web interface and harness the full potential of the VirusTotal Enterprise API for large-scale, automated threat intelligence and hunting. 

We will dive deep into how you can use the VirusTotal Enterprise API with Python and Google Colab notebooks to automate the consumption of massive datasets. You'll learn how to track the behaviors of advanced persistent threat (APT) actors and cybercrime groups through practical, real-time exercises. 

A key part of our workshop will focus on leveraging Large Language Models (LLMs) to supercharge your analysis. We'll show how you can use AI to help understand complex data, build better queries, and create insightful visualizations to enrich your information for a deeper understanding of threats. 

This session is ideal for cyber threat intelligence analysts, threat hunters, incident responders, SOC analysts, and security researchers looking to automate and scale up their threat hunting workflows. 

After the workshop, we will publish a follow-up blog post that will delve deeper into some of the exercises and examples presented, providing a valuable resource for further learning and implementation. 

We look forward to seeing you at Labscon! 

(All of the scenarios are compatible with Google Threat Intelligence)

 ---- 
Conference website: https://www.labscon.io/ 
Date: September 17-20, 2025 
Registration: Invite-Only 
Place: Scottsdale, Arizona 
Duration: 3-5h

Thursday, September 04, 2025

Uncovering a Colombian Malware Campaign with AI Code Analysis

VirusTotal Code Insight keeps adding new file formats. This time, we’re looking at two vector-based formats from very different eras: SWF and SVG. Curiously, right after we rolled out this update in production, one of the very first submitted files gave us a perfect, and unexpected, example of Code Insight in action: it uncovered an undetected malware campaign using SVG files that impersonated the Colombian justice system.

Audio version of this post, created with NotebookLM Deep Dive

SWF: a blast from the past

Flash is dead, Adobe killed it in 2020 and browsers stopped supporting it shortly after. But surprisingly, SWF files still show up on VirusTotal. Whether it’s old malware resurging, retro hunting, or long-tail campaigns, they haven’t disappeared completely.

In fact, VirusTotal received 47,812 unique SWF files in the last 30 days that had never been seen before, and 466 of them were flagged as malicious by at least one antivirus engine.

SWF files are binary and compiled. That means Code Insight needs to:

  • Unpack and decompress the container (often zlib or LZMA)
  • Parse the internal tag structure
  • Extract embedded scripts, either ActionScript 2 (AVM1) or ActionScript 3 (AVM2 bytecode + decompiling/disassembling)

Once we lift those scripts into something closer to pseudocode or readable disassembly, the LLM steps in to summarize what the file is doing and why it might be suspicious.

SVG: modern, open, and still abusable

SVGs, on the other hand, are very much alive. It’s a standard web format, open, text-based, and everywhere: websites, design tools, build systems. And that’s also why attackers like it.

In the last 30 days alone, VirusTotal received 140,803 unique SVG files that had never been seen before, and 1,442 of them were flagged as malicious by at least one antivirus engine. That's roughly 1% showing up with detections, just like SWF curiously.

SVG is just XML with <svg> at the root. If it’s a .svgz, we decompress it first. From there, Code Insight looks for:

  • Embedded JavaScript in <script> tags or event handlers (onload, onclick…)
  • Redirects using javascript: URLs or location.href
  • Obfuscation tricks (CDATA, character entities, base64 payloads, etc.)

Because SVG is plain text, the challenge isn’t unpacking, it’s spotting the malicious logic hiding in plain sight.

Let’s see a couple of examples:

When a SWF is flagged, but isn’t malicious

One common challenge in threat analysis is dealing with files that trigger detections in just a few antivirus engines. They’re not clean, but they’re not clearly malicious either. These gray areas force analysts to dig deeper, often wasting time chasing false positives.

The SWF file in the screenshot is a perfect example.

350422c3915a8a1a1336147f89061b25c8354af58db0050e2f9ef2b384e59f62

It was flagged by 3 out of 63 engines. Enough to raise doubts, but not conclusive. The detections mention known SWF heuristics and an old CVE.

Thanks to Code Insight, we can quickly understand what’s going on. It identifies the SWF as a complex ActionScript-based game, including 3D rendering, sound management, and a full level editor. The analysis also explains why the file might look suspicious: it uses obfuscated classes and cryptographic functions (like RC4 and AES), and gathers system details, techniques often associated with malware, but also common in Flash games to enforce DRM or prevent tampering.

The verdict? No malicious behavior was observed, and now we know why it looked suspicious in the first place.

This kind of context is exactly what Code Insight is designed for: saving time, reducing uncertainty, and helping you focus on real threats.


When AV misses, but Code Insight doesn’t

This second example shows the other side of the coin: a malicious SVG file that evaded all antivirus engines, going completely undetected on VirusTotal. On the surface, it looks clean, but a quick look with Code Insight tells a very different story.

1527ef7ac7f79bb1a61747652fd6015942a6c5b18b4d7ac0829dd39842ad735d

According to Code Insight: “This SVG file executes an embedded JavaScript payload upon rendering. The script decodes and injects a Base64-encoded HTML phishing page impersonating a Colombian government judicial system portal. To deceive the user, it simulates a file download with a progress bar, while in the background, it decodes a second, large Base64 string, which is a malicious ZIP archive, and forces its download.”

We validated this behavior by opening the sample in a controlled environment. As shown in the screenshots below, the fake portal is rendered exactly as described, simulating an official government document download process. The phishing site includes case numbers, security tokens, and visual cues to build trust, all of it crafted within an SVG file.


Despite its zero detections, this SVG hides two layers of abuse:

  • A convincing phishing lure, injected via inline JavaScript and decoded on-the-fly
  • A malware dropper, silently extracting and triggering the download of a ZIP file in the background

This is exactly the kind of threat Code Insight is meant to catch: well-crafted, script-based attacks that fly under the radar.

A deeper look: from one SVG to a full campaign

Curiously, the malicious SVG we highlighted earlier wasn’t just any random sample, it was one of the very first files submitted right after we deployed SVG support in Code Insight. A coincidence? Or were we seeing the tip of something bigger?

Thanks to VirusTotal Intelligence, we can search through our massive sample collection using hundreds of parameters, including queries that look inside Code Insight reports. So we ran:

type:svg AND codeinsight:"Colombian"


And voilà: 44 unique SVG files surfaced, all undetected by antivirus engines, but all flagged by Code Insight as part of the same phishing and malware campaign.

Diving into the source code of these SVGs, we found:

  • Code obfuscation techniques
  • Use of polymorphism, with slight changes in every file
  • And large amounts of dummy (garbage) code to increase entropy and evade static detection.

But Code Insight had no problem cutting through the noise.

One thing stood out: the attackers left Spanish-language comments in their scripts, with phrases like "POLIFORMISMO_MASIVO_SEGURO" and "Funciones dummy MASIVAS". While most of the code changed from sample to sample, those comments stayed exactly the same, a clear weakness, and a perfect signature for a simple YARA rule.


So we wrote a very basic one:


Running a retrohunt over the last year with this basic rule returned 523 matches.


Sorting by submission time, the first sample dates back to August 14, 2025, also submitted from Colombia, and also with 0 antivirus detections at the time.


We reanalyzed that first sample with the current version of Code Insight, and again, it produced an accurate description of the phishing page and malware dropper, impersonating the Colombian Fiscalía General de la Nación.

Looking deeper, we saw that the earliest samples were larger, around 25MB, and the size decreased over time, suggesting the attackers were evolving their payloads. Most importantly, the distribution vector was email, allowing us to pivot into delivery metadata: senders, subjects, attachment names, and more.


Final thoughts

SWF and SVG are very different formats from very different eras, but both can still cause headaches for analysts.

In the first case, Code Insight helped explain why a SWF file looked suspicious without actually being malicious. In the second, it uncovered malicious behavior in an SVG that had gone completely undetected.

This is where Code Insight helps most: giving context, saving time, and helping focus on what really matters. It’s not magic, and it won’t replace expert analysis, but it’s one more tool to cut through the noise and get to the point faster. And when Code Insight and VirusTotal Intelligence work together, one suspicious sample can become the key to revealing an entire campaign.

Thursday, August 28, 2025

, ,

Integrating Code Insight into Reverse Engineering Workflows

More than two years have passed since we announced the launch of Code Insight at RSA 2023. From that time on, we have been applying this technology in different scenarios, expanding its use in new file formats (12).

As we advance in the automated analysis of new files with Code Insight, we want to offer an alternative that enables the integration of this type of technology into the analysis of disassembled or decompiled code.


Audio version of this post, created with NotebookLM Deep Dive

To that end, we have created a new endpoint that receives code requests and returns a description of its functionality, highlighting the most relevant aspects for malware analysts. This endpoint can be used to query code blocks, chaining previous analyses with modifications or corrections made by the analyst. This significantly reduces the reverse engineering workload by providing the analyst with an assistant that pre-analyzes functions deemed interesting, acquiring knowledge as the analysis proceeds.

This endpoint can be integrated into any reverse engineering tool that processes disassembled or decompiled code. As an implementation example, the VirusTotal plugin for IDA Pro has been updated to support its use from the IDA interface. This offers a simple way to integrate relevant analyses into a notebook, allowing the analyst to keep responses that play a direct role in understanding how the code works.

Endpoint for reversed code queries

Using this new endpoint is quite simple—just make a request to the API as shown in the following example:

API_URL = 'https://www.virustotal.com'
endpoint = 'api/v3/codeinsights/analyse-binary'
headers_apiv3 = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'x-apikey': [API_KEY]
}

payload = {
    'code': [code_base64],
    'code_type' = ['disassembled'|'decompiled']
}

response = requests.post(f'{API_URL}/{endpoint}',
                         json = {'data': payload},
                         headers = headers_apiv3)


This Python code corresponds to a request to the endpoint located at ‘https://www.virustotal.com/api/v3/codeinsights/analyse-binary’, in which the code to be analyzed is included in the ‘payload’ variable as follows:

payload = {
    'code': code_base64,
    'code_type' = 'disassembled'|'decompiled'
    "history": [
        {
            "request": code_base64,
            "response": {
                            "summary": text,
                            "description": text,
                        },
        },
        {
            "request": code_base64,
            "response": {
                            "summary": text,
                            "description": text,
                        },
        },
    ]
}


The request is divided into two parts: the first includes the code being analyzed (‘code’ and ‘code_type’), and the second includes previous requests—potentially reviewed by the analyst—that provide context for analyzing the queried code.

This request will return a general description of how the submitted code snippet works ("summary") and, in addition, another text where it describes in more detail how these functionalities are carried out ("description"). In this way, the analyst can quickly check if the function contains any behavior that they consider interesting, and thus, review the execution steps or discard the function as irrelevant.

New version of the VT-IDA Plugin for IDA Pro

Along with this new endpoint, we have updated the VirusTotal plugin to show how this new functionality can be integrated into the analyst's workflow.

This new functionality can be used as follows:
  1. The analyst selects a function from the disassembled or decompiled code to be analyzed.
  2. If the response provided by the endpoint is satisfactory and reveals an interesting function, they can click ‘Accept’ to include it in a list of selected functions, which we call the ‘CodeInsight Notebook’. They can also make modifications to the ‘Summary’ and ‘Description’ fields to correct errors or add information that helps put the code in context.
  3. With each new request sent to the endpoint, all previously stored functions are included—along with any modifications made by the analyst. This allows for more accurate analyses based on previously obtained and reviewed results.
Here’s how the new version of the plugin would look after a few iterations on a malware sample:



A practical example

Let's illustrate the benefits of the new plugin with a practical example. Imagine an analyst needs to analyze a malicious binary file to understand its function. This is typically a time-consuming and complex process, but with the help of Code Insight, their workflow becomes significantly more efficient:

  1. Targeted Analysis: The analyst selects a code block they suspect might be malicious and uses the endpoint to get an automated analysis.

    The code shown below implements an anti-disassembly technique aimed at generating disassembled code that hides malicious functionality through a hidden jump to a memory address. Essentially, the resulting disassembled code is unreliable, as it doesn’t accurately represent the code that will actually be executed.



  2. Review and Refinement: At this point, a request is made to obtain an initial analysis of the code. The analyst reviews the response and can modify both the ‘Summary’ and ‘Description’ fields with their own notes or corrections.

  3. In this case, the obtained code analysis correctly identifies an anti-disassembly technique that modifies the return address. However, it does not provide information about a possible return address that would help the analyst locate the hidden code.

    At this point, the analyst can modify the output provided by the endpoint to explain how this technique works. This way, the acquired knowledge can be used in the analysis of other code blocks within the sample. To do so, the analyst simply needs to include the (reviewed) analysis in the list of analyzed functions by clicking the ‘Accept’ button.



  4. Iterative Analysis and Improved Results: The file analysis continues in such a way that, with each new request, the list of analyzed functions is sent—effectively representing the knowledge acquired from analyzing the code selected by the analyst.


And as shown in the previous image, this knowledge is used in other function queries that employ a technique similar to the one previously discussed—this time providing more details about how it works and alerting the analyst to the possibility of jumping to an address containing hidden code.

Quick Tips

The endpoint offers some interesting features for the analyst. For example, as shown in the following figure, the presence of strings written in languages other than English has been detected, providing a translation and pinpointing their location in memory.



On the other hand, while analyzing assembly code has its own pros and cons compared to decompiled code, we can gain additional benefits by analyzing a decompiled function whose disassembled code has been previously analyzed and stored in Code Insight Notebook.

For example, let's look at the decompiled code of a function previously analyzed in its disassembled version:


The image below illustrates how analyzing a decompiled function becomes richer with the help of the previously stored analysis of its disassembled code. This happens because certain features, like text strings, are visible in the disassembled code but often missing from the decompiled version.

As a result, Code Insight can provide a more concise and direct explanation by leveraging the decompiled view, which is supported by the disassembled code.



It is important to highlight that both the endpoint and this new feature of the plugin for IDA Pro are offered in trial mode, with the aim of involving the community in the progress we are making in its application to the field of reverse engineering. Although the results produced by this new functionality have been very positive during the testing phase, it is possible that the output generated by the endpoint may not be 100% accurate and could contain errors or omit some relevant details of the analysis.

We are confident that this new integration will be a great help to analysts who are gradually incorporating LLM model capabilities into their workflow. As we continue to harness the power of AI, your feedback is incredibly valuable to us. Stay connected for future updates, and thank you for your continued support.


Monday, August 25, 2025

Applying AI Analysis to PDF Threats

In our previous post we extended VirusTotal Code Insights to browser extensions and supply-chain artifacts. A key finding from that analysis was how our AI could apply contextual knowledge to its evaluation. It wasn’t just analyzing code in isolation, it was correlating a package's stated purpose (its name and description) with its actual behavior, flagging malicious logic that contradicted its public description. We’re now applying the same idea to one of the most common file formats in the world, the PDF.


Audio version of this post, created with NotebookLM Deep Dive

PDFs are multi-layered. There’s the object tree (catalog, pages, objects, streams, actions, embedded files) and there’s the visible layer (text/images the user reads). Code Insights analyzes both, then correlates: does the document content, claims, and branding make sense given its internal behaviors? That lets us surface not only classic PDF exploitation (e.g., auto-actions, JS, external launches) but also pure social engineering (phishing, vishing, QR-lures) even when the file has no executable logic. This dual approach allows the AI not only to detect malicious code but also to identify sophisticated scams.

Let's look at real-world samples surfaced by Code Insights during its initial testing phase. We'll start with cases where the PDF contains no malicious code, which traditional engines often miss because there's no executable payload to detect. This is where Code Insights proves useful, identifying clear signs of fraud and social engineering that aim to manipulate the user, not the machine.


Case 1 - Fake debt collection targeting financial fraud

This PDF is a real-world sample sent to VirusTotal and captured by Code Insights during early testing. It was flagged as malicious based entirely on its visible content, without relying on any embedded code or execution logic. The file was marked as clean by all other engines, likely because it contains no scripts, exploits, or embedded payloads.

d92a1a7460c580f8bf6af3cbd39c7840cfe6a146ee15ede8e23c50c2a85becb9

The document pretends to be a debt collection notice from a German agency acting on behalf of Amazon. It includes a formal layout, legal threats, payment instructions, and multiple references to German addresses and regulations. Visually, it looks legitimate.


However, the AI flagged it as fraudulent based on several critical inconsistencies, the most important one being the destination bank account. The payment is requested to an IBAN starting with BG, indicating a Bulgarian account. This contradicts the sender's claimed German identity and would be highly unusual for a legitimate German debt agency. This mismatch alone was enough for Code Insights to classify the file as fraudulent. Additional content cues (urgent tone, fee breakdown, legal pressure) support the assessment.

As described in the Code Insights analysis:

“The visual and textual content confirms the document is a sophisticated phishing attack. It masquerades as an urgent payment demand from a German debt collection agency, supposedly on behalf of Amazon. The document employs high-pressure tactics, including threats of legal action, additional fees, and credit score damage, to compel the recipient to act quickly. The primary and most conclusive indicator of fraud is the demand for payment to a Bulgarian bank account, which is a stark and highly irregular contradiction to the agency's purported German location and registration.”

This is a case where AI adds value by reasoning over the content semantics, not the file structure.


Case 2 - QR-based phishing (quishing) campaign

This is another real-world PDF captured during early testing of Code Insights. At the time of analysis, no antivirus or malware detection engines flagged the file as malicious. The PDF has no embedded scripts, exploits, or execution logic. From a technical perspective, it looks benign.

259e202847d04866acd76427f53bfd9a15372ed6ed56a9e54ba1c62442c945ee

The visible content, however, impersonates an HR notification about a salary increase. It includes multiple social engineering red flags: awkward grammar, lack of personalization, and an irrelevant privacy disclaimer. The only call to action is a QR code, encouraging the recipient to scan it for more details.


Code Insights analyzed and decoded the QR, extracting the hidden URL. The domain is non-corporate and clearly unrelated to HR or payroll systems. The combination of deceptive HR messaging with a QR code that conceals a phishing URL confirms the document is a credential harvesting fraud delivered via PDF.


Case 3 - Vishing via fake PayPal alert

This is another real-world PDF flagged by Code Insights during early evaluation. No antivirus or malware detection engines classified the file as malicious. Structurally, it’s simple and inert: there are no scripts, automatic actions, or embedded links. Minor stream decoding errors are present but considered low-risk anomalies.

d0bedc70085efff5218b901cdaba95d565df867495181544041ba4b8a6019cea


The threat lies entirely in the content. The document impersonates PayPal and trusted brands like Visa to deliver a fake security alert about a high-value unauthorized purchase. The language is urgent and designed to induce panic.

According to Code Insights:

“[...]the visual content of the document is a clear social engineering lure designed for a voice phishing (vishing) attack. [...] The document's sole purpose is to persuade the user to call a specific phone number under the pretense of canceling the fraudulent order. The malicious nature is confirmed by several red flags, including an awkwardly phrased greeting and a phone number with a geographic area code (808) that is deceptively labeled as "Toll-Free." This tactic aims to route the victim to a scammer for social engineering and potential fraud.”


Case 4 - Fake Tax Refund from the Australian Taxation Office

As with previous cases, this PDF wasn’t flagged by any antivirus engine in VirusTotal, but Code Insights identified it as a phishing lure that impersonates the Australian Taxation Office.

b9b763e4b091bc59e9b9f355617622dbabdc1ff2de6707a94ccb26aa7682300e


As described by Code Insights:

“This document is a phishing lure designed to impersonate the Australian Taxation Office (ATO). The visual layer uses an authentic-looking government logo and the promise of a tax refund to entice the recipient into clicking an "Access Document" button. The purpose is to have the user provide an electronic signature for a supposed refund authorization, creating a sense of urgency and financial incentive. The document exhibits multiple red flags common to phishing attacks. These include a generic greeting, a suspicious reference to a .doc file (a common malware vector), instructions that discourage direct replies, and a complete lack of legitimate contact information or alternative methods for verification. The entire premise relies on tricking the user into clicking the button, which likely leads to a malicious website for credential theft or malware download.”


Auto-executing PDF Posing as a Movie Download

Unlike previous examples, this PDF was flagged by 13 antivirus engines in VirusTotal. In this case, the attack is embedded both in the internal structure of the file and its visual appearance. Code Insights correlates these two layers, the technical and the social, to expose the malicious intent.

44e653fe79d1ab160c784c06f4d99def6419e379ef3f802af9f48d595976d2c7


As described by Code Insights:

“The document presents a social engineering lure, masquerading as a download page for pirated movies […] to entice users into clicking links. This theme, centered on illegal content distribution, is a common tactic for malware delivery. Technical analysis of the PDF's internal structure corroborates the malicious intent. The file is configured with an /OpenAction command, a high-risk feature designed to automatically execute an action upon the document being opened […] The combination of a deceptive, high-risk theme with an automatic execution function indicates that the document’s purpose is to compromise the user's system.”

We are actively improving Code Insight based on what we learn from these early cases. PDF is the 6th most common file type submitted to VirusTotal, with around 100,000 new samples uploaded every day. That volume requires us to be strategic: for now, only a selected percentage of PDF files submitted via the public web interface are processed by Code Insight, as we test, tune, and scale the system.

These first results are helping us refine both effectiveness and performance. We’ll continue expanding coverage as we improve detection of threats.