Advanced Threat Hunting: Automating Large-Scale Operations with LLMs ~ VirusTotal Blog

Last week, we were fortunate enough to attend the fantastic LABScon conference, organized by the SentinelOne Labs team. While there, we presented a workshop titled 'Advanced Threat Hunting: Automating Large-Scale Operations with LLMs.' The main goal of this workshop was to show attendees how they could automate their research using the VirusTotal API and Gemini. Specifically, we demonstrated how to integrate the power of Google Colab to quickly and efficiently generate Jupyter notebooks using natural language.

It goes without saying that the use of LLMs is a must for every analyst today. For this reason, we also want to make life easier for everyone who uses the VirusTotal API for research.

The Power of the VirusTotal API and vt-py

The VirusTotal API is the programmatic gateway to our massive repository of threat intelligence data. While the VirusTotal GUI is great for agile querying, the API unlocks the ability to conduct large-scale, automated investigations and access raw data with more pivoting opportunities.

To make interacting with the API even easier, we recommend using the vt-py library. It simplifies much of the complexity of HTTP requests, JSON parsing, and rate limit management, making it the go-to choice for Python users.

From Natural Language to Actionable Intelligence with Gemini

To bridge the gap between human questions and API queries, we can leverage the integrated Gemini in Google Colab. We have created a "meta Colab" notebook that is pre-populated with working real code snippets for interacting with the VirusTotal API to retrieve different information such as campaigns, threat actors, malware, samples, URLs among others (which we will share soon). This provides Gemini with the necessary context to understand your natural language requests and generate accurate Python code to query the VirusTotal API. Gemini doesn't call the API directly; it creates the code snippet for you to execute.

For Gemini to generate accurate and relevant code, it needs context. Our meta Colab notebook is filled with examples that act as a guide. For complex questions, it will be nice to provide the exact field names that you want to work with. This context generally falls into two categories:

Reference Documentation: We include detailed documentation directly in the Colab. For example, we provide a comprehensive list of all available file search modifiers for the VirusTotal Intelligence search endpoint. This gives Gemini the "vocabulary" it needs to construct precise queries.
Working Code Examples: The notebook is pre-populated with dozens of working vt-py code snippets for common tasks like retrieving file information, performing an intelligence search, or getting relationships. This gives Gemini the "grammar" and correct patterns for interacting with our API.

Example of code snippet context that we have included in our meta colab:

query_results_with_behaviors = []
query = "have:sigma have:yara have:ids have:malware_config fs:1d+ have:bundled_file tag:overlay"
RELATIONS = "behaviours"

async for itemobj in cli.iterator('/intelligence/search',params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): '\*'},limit=0):
    query_results_with_behaviors.append(itemobj.to_dict())

res_with_behaviors = list(query_results_with_behaviors)

Diagram showing the flow from User's Action to Actionable Intelligence via Gemini and VirusTotal API

Practical Example: Hunting for Malicious LNK Files

One of the exercises in our workshop focused on analyzing malicious LNK files submitted to VirusTotal in a specific period of time to limit results. The goal was to discover common execution techniques and interesting command-line patterns.

Here’s how you could approach this with natural language queries in our meta Colab:

1. Retrieve all LNK file objects matching the criteria.

You can start by asking Gemini:

Create a code snippet to retrieve all LNK files submitted from January 1st, 2025, until April 1st, 2025, with 5 or more positives.

Code generated:

query_results = []

# Query for LNK files submitted between January 1st, 2025, and April 1st, 2025, with 5 or more positives.
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

async for itemobj in cli.iterator('/intelligence/search',params={'query': "%s"%(query)},limit=0):
    query_results.append(itemobj.to_dict())

lnk_files = list(query_results)
print(f"Found {len(lnk_files)} LNK files matching the criteria.")

2. Flatten the data and visualize it.

Next, you can ask Gemini to process the retrieved data and create visualizations:

Flatten the data in a dataframe and create a bar chart depicting the most common TargetFileDOSName values.

Code generated:

import pandas as pd
import plotly.express as px

# Extract the 'attributes' dictionary from each object in our results
attributes_list = [result.get('attributes', {}) for result in lnk_files]

# Use json_normalize to create a flat DataFrame
df_flat = pd.json_normalize(attributes_list)

# Check if 'exiftool.TargetFileDOSName' column exists and is not empty
if 'exiftool.TargetFileDOSName' in df_flat.columns and not df_flat['exiftool.TargetFileDOSName'].empty:
    # Count the occurrences of each TargetFileDOSName
    target_file_counts = df_flat['exiftool.TargetFileDOSName'].value_counts().reset_index()
    target_file_counts.columns = ['TargetFileDOSName', 'Count']

    # Get the top 10 most common TargetFileDOSName values for visualization
    top_target_files = target_file_counts.head(10)

    # Create an interactive bar chart
    fig = px.bar(top_target_files,
                 x='Count',
                 y='TargetFileDOSName',
                 orientation='h',
                 title='Top 10 Most Common TargetFileDOSName in LNK files',
                 labels={'Count': 'Number of Occurrences', 'TargetFileDOSName': 'Target File DOS Name'})
    fig.show()
else:
    print("The 'exiftool.TargetFileDOSName' column was not found or is empty in the flattened data.")

print("DataFrame shape:", df_flat.shape)
df_flat.head()

This will generate a code snippet using pandas and Plotly to give you a clear view of the most frequent target files in the malicious LNKs.

3. Visualize relationships with a Sankey diagram.

To dig deeper and understand the relationships between the infected machines and the malware, you can ask for a more advanced visualization:

Create a Sankey diagram connecting MachineID (source) to TargetFileDOSName (target), with link widths representing the number of occurrences. Include only MachineID values with 10 or more occurrences, grouping the rest as "Other", and make the visualization interactive with hover details.

Code generated:

import plotly.graph_objects as go
import pandas as pd

# Ensure df_flat is available from previous steps
# If not, you might need to re-run the cell that creates df_flat.

# Check if necessary columns exist in df_flat
if 'lnk_info.machine_id' in df_flat.columns and 'exiftool.TargetFileDOSName' in df_flat.columns:

    # Calculate the frequency of each machine ID and target file name pair
    pair_frequencies = df_flat.groupby(['lnk_info.machine_id', 'exiftool.TargetFileDOSName']).size().reset_index(name='frequency')

    # Identify MachineIDs with less than 10 occurrences
    machine_id_counts = pair_frequencies.groupby('lnk_info.machine_id')['frequency'].sum().reset_index(name='count')
    other_machine_ids = machine_id_counts[machine_id_counts['count'] < 10]['lnk_info.machine_id'].tolist()

    # Replace infrequent MachineIDs with 'Other'
    pair_frequencies['source_node'] = pair_frequencies['lnk_info.machine_id'].apply(lambda x: 'Other' if x in other_machine_ids else x)

    # Aggregate frequencies for the 'Other' node and other valid nodes
    sankey_data = pair_frequencies.groupby(['source_node', 'exiftool.TargetFileDOSName'])['frequency'].sum().reset_index()

    # Create unique labels for the Sankey nodes
    all_nodes = pd.concat([sankey_data['source_node'], sankey_data['exiftool.TargetFileDOSName']]).unique()
    label_to_index = {label: i for i, label in enumerate(all_nodes)}

    # Create the Sankey diagram
    fig = go.Figure(data=[go.Sankey(
        node=dict(
            pad=15,
            thickness=20,
            line=dict(color="black", width=0.5),
            label=all_nodes,
            color="blue"
        ),
        link=dict(
            source=sankey_data['source_node'].map(label_to_index),
            target=sankey_data['exiftool.TargetFileDOSName'].map(label_to_index),
            value=sankey_data['frequency'],
            label=sankey_data.apply(lambda row: f"{row['source_node']} -> {row['exiftool.TargetFileDOSName']}: {row['frequency']}", axis=1)
        )
    )])

    fig.update_layout(title_text="MachineID to TargetFileDOSName Sankey Diagram (Infrequent MachineIDs grouped as 'Other')", font_size=10)
    fig.show()
else:
    print("Required columns 'lnk_info.machine_id' or 'exiftool.TargetFileDOSName' not found in the DataFrame.")

This will create a powerful visualization that shows how different machines are being targeted with specific files, allowing us identify patterns for each machineID metadata.

4. Analyze the behavior of the LNK files.

Finally, you can investigate the behavior of the LNK files themselves, for example, by looking at how they are executed:

Create a pie chart depicting the distribution of window behaviors for LNK target processes (hidden, minimized,...).

Code generated:

import plotly.express as px
import pandas as pd

# Check if the relevant column exists
if 'lnk_info.header.show_window_str' in df_flat.columns and not df_flat['lnk_info.header.show_window_str'].empty:
    # Count the occurrences of each window behavior string
    window_behavior_counts = df_flat['lnk_info.header.show_window_str'].value_counts().reset_index()
    window_behavior_counts.columns = ['Window Behavior', 'Count']

    # Create an interactive pie chart
    fig = px.pie(window_behavior_counts,
                 names='Window Behavior',
                 values='Count',
                 title='Distribution of LNK Target Process Window Behaviors',
                 hover_data={'Count': True})
    fig.show()
else:
    print("The 'lnk_info.header.show_window_str' column was not found or is empty in the flattened data. Cannot create the pie chart.")

This allows you to quickly identify if the LNK files are trying to hide their execution, a common use case for malware.

5. Geographic Threat Mapping of Submissions

Understanding the geographical distribution of malware submissions can provide crucial insights into whether a campaign is widespread or targeting a specific region. By analyzing the submission metadata for our set of LNK files, an analyst can quickly identify potential hotspots, which could indicate a geographically focused attack. This is achieved by fetching the submissions relationship for each file, extracting the country of origin for each submission, and plotting this data on a world map (please note that submissions doesn’t mean targeted countries, but in some cases could be related).

Using the LNK files identified previously, create a code snippet that retrieves the submissions relationship for each file. Then, count the number of submissions from each country and display the results on a choropleth world map to visualize the submission hotspots.

In this case, Gemini uses the step-by-step execution, which basically is a set of consecutive executions. The code generated was:

Step 1

query_results_with_submissions = []

# Define the search query string for LNK files
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

# Define the relationship to request
RELATIONS = "submissions"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_submissions.append(itemobj.to_dict())

# Convert query_results_with_submissions to a list called lnk_files_with_submissions
lnk_files_with_submissions = list(query_results_with_submissions)

# Print the number of LNK files found with submission data.
print(f"Found {len(lnk_files_with_submissions)} LNK files with submission data matching the criteria.")

Step 2

import pandas as pd

# Initialize an empty list to store submission countries
submission_countries = []

# Iterate through each item (LNK file) in the lnk_files_with_submissions list
for item in lnk_files_with_submissions:
    # Check if 'relationships' and 'submissions' exist and are not empty
    if 'relationships' in item and 'submissions' in item['relationships'] and item['relationships']['submissions']['data']:
        # Iterate through each submission object
        for submission in item['relationships']['submissions']['data']:
            # Extract the 'country' from 'submission['attributes']' and append it to the list
            if 'country' in submission['attributes']:
                submission_countries.append(submission['attributes']['country'])

# Convert the submission_countries list into a pandas Series
submission_countries_series = pd.Series(submission_countries)

# Count the occurrences of each country and store in a DataFrame
country_counts = submission_countries_series.value_counts().reset_index()
country_counts.columns = ['country', 'count']

print(f"Total unique countries found: {len(country_counts)}")
print("Top 5 submission countries:")
print(country_counts.head())

Step 3

!pip install pycountry
import plotly.express as px
import pandas as pd
import pycountry

# Function to get ISO-3 code from country name
def get_iso3(country_name):
    try:
        return pycountry.countries.search_fuzzy(country_name)[0].alpha_3
    except:
        return None

# Apply the function to get ISO-3 codes
country_counts['iso_alpha'] = country_counts['country'].apply(get_iso3)

# Create a world heatmap using Plotly Express with ISO-3 codes
fig = px.choropleth(country_counts.dropna(subset=['iso_alpha']), # Drop rows where ISO code couldn't be found
                      locations="iso_alpha", # Use ISO-3 codes for locations
                      locationmode='ISO-3',   # Specify ISO-3 location mode
                      color="count",
                      hover_name="country",
                      color_continuous_scale="Viridis",
                      title="Submission Country Heatmap")

# Update layout to increase figure size
fig.update_layout(
    autosize=False,
    width=1000, # Adjust width as needed
    height=600 # Adjust height as needed
)

fig.show()

More Examples

Here are some other examples of how you can use natural language to query the VirusTotal API:

Find malicious Chrome extensions (based on this blog):

This specific question is actually a really good example of how Google Colab will run the step-by-step process. As you can see, we asked different questions in the same prompt, for that reason Gemini will do it in different steps.

Retrieve all the samples from the /intelligence/search endpoint with the following query: fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight.

Get the bundled_files and create a new dataframe with them too, but store only those having 5 or more malicious value.

Then identify the top 5 magika of those bundled files obtained

Code generated:

Step 1

query_results_with_bundled_files = []

# Define the search query string
query = "fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight"

# Define the relationship to request
RELATIONS = "bundled_files"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_bundled_files.append(itemobj.to_dict())

# Convert query_results_with_bundled_files to a list called crx_files_with_bundled_files
crx_files_with_bundled_files = list(query_results_with_bundled_files)

# Print the number of found CRX files with bundled file data.
print(f"Found {len(crx_files_with_bundled_files)} CRX files with bundled file data matching the criteria.")

Step 2

import pandas as pd

# List to store all bundled file attributes
all_bundled_files_data = []

# Iterate through each CRX file with bundled file data
for crx_file in crx_files_with_bundled_files:
    if 'relationships' in crx_file and 'bundled_files' in crx_file['relationships']:
        for bundled_file in crx_file['relationships']['bundled_files']['data']:
            all_bundled_files_data.append(bundled_file.get('attributes', {}))

# Create a DataFrame from the extracted bundled file attributes
df_bundled_files = pd.json_normalize(all_bundled_files_data)

# Filter for bundled files with 5 or more malicious detections
df_malicious_bundled_files = df_bundled_files[df_bundled_files['last_analysis_stats.malicious'] >= 5]

print(f"Total bundled files extracted: {len(df_bundled_files)}")
print(f"Bundled files with 5 or more malicious detections: {len(df_malicious_bundled_files)}")

df_malicious_bundled_files.head()

Step 3

import pandas as pd
import plotly.express as px

# Check if the 'magika' column exists and is not empty in the filtered DataFrame
if 'magika' in df_malicious_bundled_files.columns and not df_malicious_bundled_files['magika'].empty:
    # Count the occurrences of each magika type
    magika_counts = df_malicious_bundled_files['magika'].value_counts().reset_index()
    magika_counts.columns = ['Magika Type', 'Count']

    # Get the top 5 most frequent magika types
    top_5_magika = magika_counts.head(5)

    print("Top 5 Magika Types in malicious bundled files:")
    print(top_5_magika)

    # Visualize the top 5 magika types
    fig = px.bar(top_5_magika,
                 x='Count',
                 y='Magika Type',
                 orientation='h',
                 title='Top 5 Magika Types in Malicious Bundled Files',
                 labels={'Count': 'Number of Occurrences', 'Magika Type': 'Magika Type'}) 
    fig.update_layout(yaxis={'categoryorder':'total ascending'}) # Order bars by count
    fig.show()
else:
    print("The 'magika' column was not found or is empty in the filtered malicious bundled files DataFrame. Cannot identify top magika types.")

Retrieve threat actors:

Retrieve threat actors targeting the United Kingdom with an espionage motivation. Sort the results in descending order of relevance. Display the total number of threat actors and their names.

Investigate campaigns:

Retrieve information about threat actors and malware involved in campaigns targeting Pakistan. For each threat actor, retrieve its country of origin, motivations, and targeted industries. For each malware, retrieve its name.

What’s next

This workshop, co-authored with Aleksandar from Sentinel LABS, will be presented at future conferences to show the community how to get the most out of the VirusTotal API. We'll be updating the content of our meta colab regularly and will share more information soon about how to get the Google Colab.

In the meantime, if you have any feedback or ideas to contribute, we are open to suggestions.

Popular Posts

Blog Archive

Tuesday, September 30, 2025