Thursday, October 23, 2025

Hugging Face and VirusTotal: Building Trust in AI Models

We’re happy to announce a collaboration with Hugging Face, an open platform that fosters collaboration and transparency in AI, to make security insights more accessible to the community. VirusTotal’s analysis results are now integrated directly into the Hugging Face platform, helping users understand potential risks in model files, datasets, and related artifacts before they download them.

Security context where you need it

When you browse a file on Hugging Face, you’ll now see security information coming from different scanners, including VirusTotal results. In the example below, VirusTotal detects the file as unsafe and links directly to its public report for full details.

Addressing new challenges

As AI adoption grows, we see familiar threats taking new forms, from tampered model files and unsafe dependencies to data poisoning and hidden backdoors. These risks are part of the broader AI supply chain challenge, where compromised models, scripts, or datasets can silently affect downstream applications.

At VirusTotal, we’re also evolving to meet the challenges of this new landscape. We’re developing AI-driven analysis tools such as Code Insight, which uses LLMs to understand and explain code behavior, and we’re adding support for specialized tools for model/serialization formats, including picklescan, safepickle, and ModelScan, to help surface risky patterns and unsafe deserialization flows.

Our collaboration with Hugging Face strengthens this effort. By connecting VirusTotal’s analysis with Hugging Face’s AI Hub, we can expand our research into threats targeting AI models and share that visibility across the industry, helping everyone build better defenses, tools, and approaches to improve global security.

Stronger together

This collaboration is part of our ongoing mission to make security intelligence simpler and more accessible. It follows our recent effort to streamline VirusTotal access and API usage for the community, and now extends that same spirit into the AI space.

We believe that openness, collaboration, and shared knowledge are the best defenses against evolving threats. Hugging Face and VirusTotal share that vision: empowering researchers, developers, and defenders worldwide to build safely and openly.

Tuesday, October 21, 2025

, , , , ,

VTPRACTITIONERS{SEQRITE}: Tracking UNG0002, Silent Lynx and DragonClone

Introduction

One of the best parts of being at VirusTotal (VT) is seeing all the amazing ways our community uses our tools to hunt down threats. We love hearing about your successes, and we think the rest of the community would too.
That's why we're so excited to start a new blog series where we'll be sharing success stories from some of our customers. They'll be giving us a behind-the-scenes look at how they pivot from an initial clue to uncover entire campaigns.
To kick things off, we're thrilled to have our friends from SEQRITE join us. Their APT-Team is full of incredible threat hunters, and they've got a great story to share about how they've used VT to track some sophisticated actors.

How VT plays a role in hunting for analysts

For a threat analyst, the hunt often begins with a single, seemingly isolated clue—a suspicious file, a strange domain, or an odd IP address. The challenge is to connect that one piece of the puzzle to the larger picture. This is where VT truly shines.
VT is more than just a tool for checking if a file is malicious. It's a massive, living database of digital artifacts (process activity, registry key activity, memory dumps, LLM verdicts, among others) and their relationships. It allows analysts to pivot from one indicator of compromise to another, uncovering hidden connections and mapping out entire attack campaigns. It's this ability to connect the dots—to see how a piece of malware communicates with a C2 server, what other files are associated with it, what processes were launched or files were used to set persistence or exfiltrate information, and who else has seen it—that transforms a simple file check into a full-blown investigation. The following story from SEQRITE is a perfect example of this process in action.

Seqrite - Success Story

[In the words of SEQRITE…]
We at SEQRITE APT-Team perform a lot of activities, including threat hunting and threat intelligence, using customer telemetry and multiple other data corpuses. Without an iota of doubt, apart from our customer telemetry, the VT corpus has aided us a decent amount in converting our research, which includes hunting unique campaigns and multiple pivots that have led us to an interesting set of campaigns, ranging across multiple spheres of Asian geography, including Central, South, and East Asia.

UNG0002

SEQRITE APT-Team have been tracking a south-east asian threat entity, which was termed as UNG0002, using certain behavioral artefacts, such using similar OPSEC mistakes across multiple campaigns and using similar set of decoys and post-exploitation toolkit across multiple operational campaigns ranging from May 2024 to May 2025.
During the initial phase of this campaign, the threat actor performed multiple targets across Hong Kong and Pakistan against sectors involving defence, electrotechnical, medical science, academia and much more.
VT corpus has helped us to pivot through Cobalt Strike oriented beacons, which were used by this threat actor to target various sectors. In our hunt for malicious activity, we discovered a series of Cobalt Strike beacons. These were all delivered through similar ZIP files, which acted as lures. Each ZIP archive contained the same set of file types: a malicious executable, along with LNK, VBS, and PDF decoy files. The beacons themselves were also similar, sharing configurations, filenames and compilation timestamps.
Using the timestamps from the malicious executables and the filenames previously mentioned, we discovered up to 14 different samples, all of them related to the campaign with this query
VirusTotal query: metadata:"2015:07:10 03:27:31+00:00" filename:"imebroker.exe"

based on the configuration extracted by VT, we could use the public key extracted to identify more samples using exactly the same with the following query
malware_config:30819f300d06092a864886f70d010101050003818d003081890281810096cc4e6ad9aee91ca69b7b44465e17412626a11c7855b7a69daad00f48c0ea98f0e389a0a1c4b74332bf0d603a6e53e05ee734c9a289ff172204bfc9430ed4d6041402d02b526e902b95f6f219598cb1b6391403fa627ab36dbe88646620369e7ec89bdc31f1a2b0bedba1852d5e7656d3b297f9d39f357816f0677563bc496b020301000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Besides these executables, we mentioned that there were also LNK files within the ZIP files. After analyzing them, a consistent LNK-ID metadata revealed the same identifiers across many samples. Querying VT for those LNK-IDs exposed we could identify new files related to the campaign.
VirusTotal query: metadata:"laptop-g5qalv96"

Decoy documents identified within the ZIP files mentioned above

We initially tracked several campaigns leveraging LNK-based device IDs and Cobalt Strike beacons. However, an intriguing shift began to emerge in the September-October activity. We observed a new set of campaigns that frequently used CV-themed decoys, often impersonating students from prominent Chinese research institutions.
While the spear-phishing tactics remained similar, the final execution changed. The threat actors dropped their Cobalt Strike beacons and pivoted toward DLL-Sideloading for their payloads, all while keeping the same decoy theme. This significant change in technique led us to identify a second major wave of this activity, which we're officially labeling Operation AmberMist.
Tracking this second wave of operations attributed to the UNG0002 cluster, we observed a recurring behavioral artifact: the use of academia-themed lures targeting victims in China and Hong Kong.
Across these campaigns, multiple queries were leveraged, but a consistent pattern emerged—heavy reliance on LOLBINS such as wscript.exe, cscript.exe, and VBScripts for persistence.
By developing a simple yet effective hunting query, we were able to uncover a previously unseen sample not publicly reported:
type:zip AND (metadata:"lnk" AND metadata:".vbs" AND metadata:".pdf") and submitter:HK
VirusTotal query: type:zip AND (metadata:"lnk" AND metadata:".vbs" AND metadata:".pdf") and submitter:HK

Silent Lynx

Another campaign tracked by the SEQRITE APT-team, named Silent Lynx, targeted multiple sectors including banking. As in the previous described case, thanks to VT we were able to pivot and identify new samples associated with this campaign.
Initial Discovery and Pivoting
During the initial phase of this campaign, we discovered a decoy-based SPECA-related archive file targeting Kyrgyzstan around December 2024 - January 2025. The decoy was designed to distract from the real payload: a malicious C++ implant.
Decoy document identified during our research

Second campaign of Silent Lynx @ Bank of Kyrgyz Republic
Email identified during our reserach

We performed multiple pivots focusing on the implant, starting by analyzing the sample’s metadata and network indicators and functionalities, we found that the threat actor had been using a similar C++ implant, which led us to another campaign targeting the banking sector of Kyrgyzstan related to Silent Lynx too.
Information obtained during the analysis of the C++ implants

Information obtained during the analysis of the C++ implants

We leveraged VT corpus for deploying multiple Livehunt rules on multiple junctures, some of the simpler examples are as follows:


  • Looking at the usage of encoded Telegram Bot based payload inside the C++ implant. Using either content or malware_config modifiers when extracted from the config could help us to identify new samples.

  • Spawning Powershell.exe LOLBIN.

  • VT search enablers for checking for malicious email files, if uploaded from Central Asian Geosphere.

  • ISO-oriented first-stagers.

  • Multiple behavioral overlaps between YoroTrooper & Silent Lynx and further hunting hypothesis developed by us. 

Leveraging VT corpus and using further pivots on the above metrics and many others included on the malicious spear-phishing email, we also tracked some further campaigns. Most importantly, we developed a new YARA rule and a new hypothesis every time to hunt for similar implants leveraging the Livehunt feature depending on the tailored specifications and the raw data we received during hunting keeping in mind the cases of false positives and false negatives.
Decoy document identified during our hunting activities

Submissions identified in the decoy document

The threat actor repeatedly used the same implant across multiple campaigns in Uzbekistan and Turkmenistan. Using hunting queries through VT along with submitter:UZ or submitter:TM helped us to identify these samples.
The most important pivot in our investigation was the malware sample itself as shown in the previous screenshots was the usage of encoded PowerShell blob spawning powershell.exe, which was used multiple times across different campaigns. This sample acted as a key indicator, allowing us to uncover other campaigns targeting critical sectors in the region, and confirmed the repetitive nature of the actor's operations.
Also, thanks to VT feature of collections, we further leveraged it to build an attribution of the threat entity.
Collections used during the attribution process

DragonClone

Finally, the last campaign that we wanted to illustrate how pivoting within the VT ecosystem enabled our team to uncover new samples was by a group we named DRAGONCLONE
The SEQRITE APT Team has been monitoring DRAGONCLONE as they actively target critical sectors across Asia and the globe. They utilize sophisticated methods for cyber-espionage, compromising strategic organizations in sectors like telecom and energy through the deployment of custom malware implants, the exploitation of unpatched vulnerabilities, and extensive spear-phishing.
Initial Discovery
Recently, on 13th May, our team discovered a malicious ZIP file that surfaced across various sources, including VT. The ZIP file was used as a preliminary infection vector and contained multiple EXE and DLL files inside the archive, like this one which contains the malicious payload.
Chinese-based threat actors have a well-known tendency to deliver DLL sideloading implants as part of their infection chains. Leveraging crowdsourced Sigma rules in VT, along with personal hunting techniques using static YARA signatures, we were able to track and hunt this malicious spear-phishing attachment effectively. In their public Sigma Rules list you can find different Sigma Rules that are created to identify DLL SideLoading.
Pivoting Certificates via VT Corpus
While exploring the network of related artifacts, we could not initially find any direct commonalities. However, a particular clean-looking executable named “2025 China Mobile Tietong Co., Ltd. Internal Training Program” raised our concern. Its naming and metadata suggested potential masquerading behavior, making it a critical pivot point that required deeper investigation.
Certificates are one of the most key indicators, while looking into malicious artefacts, we saw that it is a fresh and clean copy of WonderShare’s Repairit Software, a well known software for repairing corrupted files, whereas a suspicious concern is that it has been signed by ShenZhen Thunder NetWorking Technologies Ltd
VirusTotal query: signature:"ShenZhen Thunder Networking Technologies Ltd."

Using this hunch, we discovered and hunted for executables, which have been signed by similar and found there have been multiple malicious binaries, although, this has not been the only indicator or pivot, but a key one, to research for further ones.
Pivoting on Malware Configs via VT Corpus
We analyzed the loader and determined it's slightly advanced, performing complex tasks like anti-debugging. More significantly, it drops V-Shell, a post-exploitation toolkit. V-Shell was originally open-source but later taken down by its authors and has been observed in campaigns by Earth Lamia.
After extracting the V-Shell shellcode, we discovered an unusual malware configuration property: qwe123qwe. By leveraging the VT corpus to pivot on this finding, we were able to identify additional V-Shell implant samples potentially linked to this campaign.
VirusTotal query: malware_config:"qwe123qwe"

VT Tips (based on the success story)

[In the words of VirusTotal…]
Threat hunting is an art, and a good artist needs the right tools and techniques. In this section, we'll share some practical tips for pivoting and hunting within the VirusTotal ecosystem, inspired by the techniques used in the campaigns discussed in this blog post.

Hunt by Malware Configuration

Many malware families use configuration files to store C2 information, encryption keys, and other operational data. For some malware families, VirusTotal automatically extracts these configurations. You can use unique values from these configurations to find other samples from the same campaign.
For instance, in the DRAGONCLONE investigation, the V-Shell implant had an unusual malware configuration property: qwe123qwe. A simple query like malware_config:"qwe123qwe" in VT can reveal other samples using the same configuration. Similarly, the Cobalt Strike beacons used by UNG0002 had a unique public key in their configuration that could be used for pivoting. That's thanks to Backscatter. We've written blogs showing how to do advanced hunting using only the malware_config modifier. Remember that you can search for samples by family name like malware_config:"redline" up to Telegram tokens and even URLs configured in the malware configuration like malware_config:"https://steamcommunity.com/profiles/76561198780612393".

Don't Overlook LNK File Metadata

Threat actors often make operational security (OPSEC) mistakes. One common mistake is failing to remove metadata from files, including LNK (shortcut) files. This metadata can reveal information about the attacker's machine, such as the hostname.
In the UNG0002 campaign, the actor consistently used LNK files with the same metadata, specifically the machine identifier laptop-g5qalv96. We know that this information can be also modified by them to deceive security researchers, but often we observe good information that can be used to track them. This allowed the SEQRITE team to uncover a wider set of samples by querying VirusTotal for this metadata string.

Track Actors via Leaked Bot Tokens

Some malware, especially those using public platforms for command and control, will have hardcoded API tokens. As seen in the "Silent Lynx" campaign, a PowerShell script used a hardcoded Telegram bot token for C2 communication and data exfiltration.
These tokens can be extracted from memory dumps during sandbox execution or from the malware's code itself. Once you have a token, you may be able to track the threat actor's commands and even identify other victims, as was done in the Silent Lynx investigation. A concrete example of using Telegram bot tokens is the query malware_config:"bot7213845603:AAFFyxsyId9av6CCDVB1BCAM5hKLby41Dr8", which is associated with four infostealer samples uploaded between 2024 and 2025.

Leverage Code-Signing Certificates

Threat actors sometimes sign their malicious executables to make them appear legitimate. They may use stolen certificates or freshly created ones. These certificates can be a powerful pivot point.
In the DRAGONCLONE case, a suspicious executable was signed by "ShenZhen Thunder Networking Technologies Ltd.". By searching for other files signed with the same certificate (signature:"ShenZhen Thunder Networking Technologies Ltd."), you can uncover other tools in the attacker's arsenal.

Utilize YARA and Sigma Rules

For proactive hunting, you can develop your own YARA rules to find malware families based on unique strings, code patterns, or other characteristics. This was a key technique in the "Silent Lynx" campaign for hunting similar implants.
Additionally, you can leverage the power of the community by using crowdsourced Sigma rules in VirusTotal, even within your YARA rules. These rules can help you identify malicious behaviors, such as the DLL sideloading techniques used by DRAGONCLONE, directly from sandbox execution data.
For example, If you want to search for the Sigma rule "Potential DLL Sideloading Of MsCorSvc.DLL" in VT files, you can use the query sigma_rule:99b4e5347f2c92e8a7aeac6dc7a4175104a8ba3354e022684bd3780ea9224137 to do so. All the Sigma rules are updated from the public repo and can be consumed here.

Conclusion

The success stories of the SEQRITE APT-Team in tracking campaigns like UNG0002, Silent Lynx, and DRAGONCLONE demonstrate the power of VirusTotal as a collaborative and comprehensive threat intelligence platform. By leveraging a combination of malware configuration analysis, metadata pivoting, and community-driven tools like YARA and Sigma rules, security researchers can effectively uncover and track sophisticated threat actors.
These examples highlight that successful threat hunting is not just about having the right tools, but also about applying creative and persistent investigation techniques. The ability to pivot from one piece of evidence to another is crucial in connecting the dots and revealing the full scope of a campaign. The SEQRITE team has demonstrated a deep understanding of these pivoting techniques, and we appreciate that they have decided to share their valuable insights with the rest of the community.
We hope these tips and stories have been insightful and will help you in your own threat-hunting endeavors. The fight against cybercrime is a collective effort, and the more we share our knowledge and experiences, the stronger we become as a community.
If you have a success story of using VirusTotal that you would like to share with the community, we would be delighted to hear from you. Please reach out to us, and we will be happy to feature your story in a future blog post at practitioners@virustotal.com.
Together, we can make the digital world a safer place.

Wednesday, October 08, 2025

Simpler Access for a Stronger VirusTotal

VirusTotal (VT) was founded on a simple principle: we are all stronger when we work together. Every file shared, every engine integrated, and every rule contributed strengthens our collective defense against cyber threats.

In the spirit of that collaboration, and in light of recent community discussions, we want to share our vision for the future of the platform. We have heard your feedback on the need for simplicity and accessibility, and we are taking action. VT will continue to be broadly available with straightforward options, including a robust free tier for our contributors and community.

Our commitment is to ensure the long-term health and openness of the platform. To do that, we are focused on three key goals:

  • Preserve VT as an open, collaborative platform built for the common good.
  • Provide our contributors with a reliable, cost-effective, and long-term framework for partnership.
  • Improve access to advanced features for academics, researchers, and defenders dedicated to public service.

Today, Google Threat Intelligence offers new ways to access advanced and curated threat intelligence, powered by the combined intelligence of VT, Mandiant and Google. As part of this broader evolution, we’re making sure VT remains open and transparent, while offering flexible options that meet the needs of our diverse users, from security researchers and startups to MSSPs and other security vendors.

VT now offers simpler pricing with tiers optimized for our partner contributors and community. We’re also introducing a Contributor Tier, a dedicated model for our engine partners. It ensures continuous access to VT feeds, priority support, and early access to new features. This tier recognizes their essential role in keeping VirusTotal open, collaborative, and globally impactful.

Key Access Tiers
Tier For Who Key Features Annual Price
VT Community Individual researchers, academics, educators. File scanning, URL scanning, public API, community features. Free.
VT Contributor Technological partners contributing detection engines. Feed of blindspots for free and discounts based on contribution tiers. From free (feed of blindspots) upon program acceptance.
VT Lite Small teams, early-stage startups, small MSSPs, SMB. Non-commercial. Advanced search, YARA hunting, File downloading, Private API, Private Scanning. Low-moderate usage. From $5k for low API volumes.
VT Duet Large organizations. Full feature set, high API quota. Community Intelligence only. Based on number of affiliates covered and contribution level.

You’ll notice that security vendors who do not contribute detections are not included in these tiers, as we are reaffirming our long-standing 2016 commitment to a healthy community. We welcome any organization to become a contributor and join us in protecting the common good. If you want to contribute, please let us know.

While Google Threat Intelligence will continue to deliver advanced threat context for enterprise customers, VirusTotal will always remain the collaborative, transparent, and community-driven foundation.

Thank you for helping us make this possible. We’re here to build the next chapter with you, not just for you.

Bernardo Quintero
Founder of VirusTotal

Wednesday, October 01, 2025

Crowdsourced AI += Exodia Labs

We’re adding a new specialist to VirusTotal’s Crowdsourced AI lineup: Exodia Labs, with an AI engine focused on analyzing Chrome extension (.CRX) files. This complements our existing Code Insight and other AI contributors by helping users better understand this format and detect possible threats.

What you get in VirusTotal

  • Second opinion for .CRX: Exodia Labs adds another AI analysis stream alongside Code Insight. It gives a fresh, independent view on the same sample type. Like all Crowdsourced AI engines, it’s meant to complement (not replace) traditional detections and human analysis.
  • Clear verdict in the UI: Each Exodia report includes a simple verdict (benign, suspicious, or malicious) to help you quickly spot risky extensions.
  • Searchable results in VT Intelligence: You can now use new operators to search and pivot across Exodia Labs results:
    • exodialabs_ai_verdict:malicious | suspicious | benign
    • exodialabs_ai_analysis:<keywords>

See it in action

Here are a few Exodia Labs AI report examples you can explore in VT:

31da559ae4af91106e0a18740d6bb8916e2017f6a37a02ea2a8127f1da30ec77

69c926ea84536bdaba7e4f765bde65eb0199ac30be3a96729a21ea7efa48d721

You can also explore Exodia Labs verdicts at scale using VirusTotal Intelligence.

For example, the following query lists Chrome extensions flagged as malicious and related to financial activity: exodialabs_ai_verdict:malicious AND exodialabs_ai_analysis:financial


This search shows several .CRX files where Exodia Labs AI detected suspicious financial behavior.

Let’s look at two examples:

  • Westpac Extension: Exodia Labs flags it as malicious. The AI analysis shows the extension connects to a remote WebSocket server and exfiltrates cookies, one-time passwords, and payment tokens. It manipulates banking pages and forwards captured credentials to a C2, showing signs of credential theft and financial data tampering.
    34244257f633e104d06b0c4273caca96eb916d26540eeea68495707cbc920bdb

  • Spidy Extension: Also flagged as malicious. The analysis shows it requests and cookies permissions, executes remote crawling jobs, and collects user profile and bank account details. The extension behaves like a data-exfiltration client handling financial credentials not mentioned in its public description.
    718eab32b5597e479d63f1d4e6402b7844eb9a4ee01c9028e44eb202d5ebcb2f

About Exodia Labs

Exodia Labs builds AI-driven analysis for Chrome Web Store extensions, also exposing a browser add-on that lets users request an AI assessment directly from an extension’s store page and view a detailed report plus a verdict. For security teams, the same analysis powers the backend results we index in VirusTotal.

Join Crowdsourced AI

Crowdsourced AI is about aggregating independent AI solutions that explain behavior and provide judgments across many file types, helping you understand unfamiliar code faster and spot novel threats sooner. If you build AI solutions that can help the community, we want to hear from you.

Tuesday, September 30, 2025

, , ,

Advanced Threat Hunting: Automating Large-Scale Operations with LLMs

Last week, we were fortunate enough to attend the fantastic LABScon conference, organized by the SentinelOne Labs team. While there, we presented a workshop titled 'Advanced Threat Hunting: Automating Large-Scale Operations with LLMs.' The main goal of this workshop was to show attendees how they could automate their research using the VirusTotal API and Gemini. Specifically, we demonstrated how to integrate the power of Google Colab to quickly and efficiently generate Jupyter notebooks using natural language.

It goes without saying that the use of LLMs is a must for every analyst today. For this reason, we also want to make life easier for everyone who uses the VirusTotal API for research.

The Power of the VirusTotal API and vt-py

The VirusTotal API is the programmatic gateway to our massive repository of threat intelligence data. While the VirusTotal GUI is great for agile querying, the API unlocks the ability to conduct large-scale, automated investigations and access raw data with more pivoting opportunities.

To make interacting with the API even easier, we recommend using the vt-py library. It simplifies much of the complexity of HTTP requests, JSON parsing, and rate limit management, making it the go-to choice for Python users.

From Natural Language to Actionable Intelligence with Gemini

To bridge the gap between human questions and API queries, we can leverage the integrated Gemini in Google Colab. We have created a "meta Colab" notebook that is pre-populated with working real code snippets for interacting with the VirusTotal API to retrieve different information such as campaigns, threat actors, malware, samples, URLs among others (which we will share soon). This provides Gemini with the necessary context to understand your natural language requests and generate accurate Python code to query the VirusTotal API. Gemini doesn't call the API directly; it creates the code snippet for you to execute.

For Gemini to generate accurate and relevant code, it needs context. Our meta Colab notebook is filled with examples that act as a guide. For complex questions, it will be nice to provide the exact field names that you want to work with. This context generally falls into two categories:

  1. Reference Documentation: We include detailed documentation directly in the Colab. For example, we provide a comprehensive list of all available file search modifiers for the VirusTotal Intelligence search endpoint. This gives Gemini the "vocabulary" it needs to construct precise queries.
  2. Working Code Examples: The notebook is pre-populated with dozens of working vt-py code snippets for common tasks like retrieving file information, performing an intelligence search, or getting relationships. This gives Gemini the "grammar" and correct patterns for interacting with our API.

Example of code snippet context that we have included in our meta colab:

query_results_with_behaviors = []
query = "have:sigma have:yara have:ids have:malware_config fs:1d+ have:bundled_file tag:overlay"
RELATIONS = "behaviours"

async for itemobj in cli.iterator('/intelligence/search',params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): '\*'},limit=0):
    query_results_with_behaviors.append(itemobj.to_dict())

res_with_behaviors = list(query_results_with_behaviors)
Diagram showing the flow from User's Action to Actionable Intelligence via Gemini and VirusTotal API

Practical Example: Hunting for Malicious LNK Files

One of the exercises in our workshop focused on analyzing malicious LNK files submitted to VirusTotal in a specific period of time to limit results. The goal was to discover common execution techniques and interesting command-line patterns.

Here’s how you could approach this with natural language queries in our meta Colab:

1. Retrieve all LNK file objects matching the criteria.

You can start by asking Gemini:

Create a code snippet to retrieve all LNK files submitted from January 1st, 2025, until April 1st, 2025, with 5 or more positives.

Code generated:

query_results = []

# Query for LNK files submitted between January 1st, 2025, and April 1st, 2025, with 5 or more positives.
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

async for itemobj in cli.iterator('/intelligence/search',params={'query': "%s"%(query)},limit=0):
    query_results.append(itemobj.to_dict())

lnk_files = list(query_results)
print(f"Found {len(lnk_files)} LNK files matching the criteria.")

2. Flatten the data and visualize it.

Next, you can ask Gemini to process the retrieved data and create visualizations:

Flatten the data in a dataframe and create a bar chart depicting the most common TargetFileDOSName values.

Code generated:

import pandas as pd
import plotly.express as px

# Extract the 'attributes' dictionary from each object in our results
attributes_list = [result.get('attributes', {}) for result in lnk_files]

# Use json_normalize to create a flat DataFrame
df_flat = pd.json_normalize(attributes_list)

# Check if 'exiftool.TargetFileDOSName' column exists and is not empty
if 'exiftool.TargetFileDOSName' in df_flat.columns and not df_flat['exiftool.TargetFileDOSName'].empty:
    # Count the occurrences of each TargetFileDOSName
    target_file_counts = df_flat['exiftool.TargetFileDOSName'].value_counts().reset_index()
    target_file_counts.columns = ['TargetFileDOSName', 'Count']

    # Get the top 10 most common TargetFileDOSName values for visualization
    top_target_files = target_file_counts.head(10)

    # Create an interactive bar chart
    fig = px.bar(top_target_files,
                 x='Count',
                 y='TargetFileDOSName',
                 orientation='h',
                 title='Top 10 Most Common TargetFileDOSName in LNK files',
                 labels={'Count': 'Number of Occurrences', 'TargetFileDOSName': 'Target File DOS Name'})
    fig.show()
else:
    print("The 'exiftool.TargetFileDOSName' column was not found or is empty in the flattened data.")

print("DataFrame shape:", df_flat.shape)
df_flat.head()

This will generate a code snippet using pandas and Plotly to give you a clear view of the most frequent target files in the malicious LNKs.

3. Visualize relationships with a Sankey diagram.

To dig deeper and understand the relationships between the infected machines and the malware, you can ask for a more advanced visualization:

Create a Sankey diagram connecting MachineID (source) to TargetFileDOSName (target), with link widths representing the number of occurrences. Include only MachineID values with 10 or more occurrences, grouping the rest as "Other", and make the visualization interactive with hover details.

Code generated:

import plotly.graph_objects as go
import pandas as pd

# Ensure df_flat is available from previous steps
# If not, you might need to re-run the cell that creates df_flat.

# Check if necessary columns exist in df_flat
if 'lnk_info.machine_id' in df_flat.columns and 'exiftool.TargetFileDOSName' in df_flat.columns:

    # Calculate the frequency of each machine ID and target file name pair
    pair_frequencies = df_flat.groupby(['lnk_info.machine_id', 'exiftool.TargetFileDOSName']).size().reset_index(name='frequency')

    # Identify MachineIDs with less than 10 occurrences
    machine_id_counts = pair_frequencies.groupby('lnk_info.machine_id')['frequency'].sum().reset_index(name='count')
    other_machine_ids = machine_id_counts[machine_id_counts['count'] < 10]['lnk_info.machine_id'].tolist()

    # Replace infrequent MachineIDs with 'Other'
    pair_frequencies['source_node'] = pair_frequencies['lnk_info.machine_id'].apply(lambda x: 'Other' if x in other_machine_ids else x)

    # Aggregate frequencies for the 'Other' node and other valid nodes
    sankey_data = pair_frequencies.groupby(['source_node', 'exiftool.TargetFileDOSName'])['frequency'].sum().reset_index()

    # Create unique labels for the Sankey nodes
    all_nodes = pd.concat([sankey_data['source_node'], sankey_data['exiftool.TargetFileDOSName']]).unique()
    label_to_index = {label: i for i, label in enumerate(all_nodes)}

    # Create the Sankey diagram
    fig = go.Figure(data=[go.Sankey(
        node=dict(
            pad=15,
            thickness=20,
            line=dict(color="black", width=0.5),
            label=all_nodes,
            color="blue"
        ),
        link=dict(
            source=sankey_data['source_node'].map(label_to_index),
            target=sankey_data['exiftool.TargetFileDOSName'].map(label_to_index),
            value=sankey_data['frequency'],
            label=sankey_data.apply(lambda row: f"{row['source_node']} -> {row['exiftool.TargetFileDOSName']}: {row['frequency']}", axis=1)
        )
    )])

    fig.update_layout(title_text="MachineID to TargetFileDOSName Sankey Diagram (Infrequent MachineIDs grouped as 'Other')", font_size=10)
    fig.show()
else:
    print("Required columns 'lnk_info.machine_id' or 'exiftool.TargetFileDOSName' not found in the DataFrame.")

This will create a powerful visualization that shows how different machines are being targeted with specific files, allowing us identify patterns for each machineID metadata.

Sankey diagram

4. Analyze the behavior of the LNK files.

Finally, you can investigate the behavior of the LNK files themselves, for example, by looking at how they are executed:

Create a pie chart depicting the distribution of window behaviors for LNK target processes (hidden, minimized,...).

Code generated:

import plotly.express as px
import pandas as pd

# Check if the relevant column exists
if 'lnk_info.header.show_window_str' in df_flat.columns and not df_flat['lnk_info.header.show_window_str'].empty:
    # Count the occurrences of each window behavior string
    window_behavior_counts = df_flat['lnk_info.header.show_window_str'].value_counts().reset_index()
    window_behavior_counts.columns = ['Window Behavior', 'Count']

    # Create an interactive pie chart
    fig = px.pie(window_behavior_counts,
                 names='Window Behavior',
                 values='Count',
                 title='Distribution of LNK Target Process Window Behaviors',
                 hover_data={'Count': True})
    fig.show()
else:
    print("The 'lnk_info.header.show_window_str' column was not found or is empty in the flattened data. Cannot create the pie chart.")
Pie chart of window behaviors

This allows you to quickly identify if the LNK files are trying to hide their execution, a common use case for malware.

5. Geographic Threat Mapping of Submissions

Understanding the geographical distribution of malware submissions can provide crucial insights into whether a campaign is widespread or targeting a specific region. By analyzing the submission metadata for our set of LNK files, an analyst can quickly identify potential hotspots, which could indicate a geographically focused attack. This is achieved by fetching the submissions relationship for each file, extracting the country of origin for each submission, and plotting this data on a world map (please note that submissions doesn’t mean targeted countries, but in some cases could be related).

Using the LNK files identified previously, create a code snippet that retrieves the submissions relationship for each file. Then, count the number of submissions from each country and display the results on a choropleth world map to visualize the submission hotspots.

In this case, Gemini uses the step-by-step execution, which basically is a set of consecutive executions. The code generated was:

Step 1
query_results_with_submissions = []

# Define the search query string for LNK files
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

# Define the relationship to request
RELATIONS = "submissions"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_submissions.append(itemobj.to_dict())

# Convert query_results_with_submissions to a list called lnk_files_with_submissions
lnk_files_with_submissions = list(query_results_with_submissions)

# Print the number of LNK files found with submission data.
print(f"Found {len(lnk_files_with_submissions)} LNK files with submission data matching the criteria.")
Step 2
import pandas as pd

# Initialize an empty list to store submission countries
submission_countries = []

# Iterate through each item (LNK file) in the lnk_files_with_submissions list
for item in lnk_files_with_submissions:
    # Check if 'relationships' and 'submissions' exist and are not empty
    if 'relationships' in item and 'submissions' in item['relationships'] and item['relationships']['submissions']['data']:
        # Iterate through each submission object
        for submission in item['relationships']['submissions']['data']:
            # Extract the 'country' from 'submission['attributes']' and append it to the list
            if 'country' in submission['attributes']:
                submission_countries.append(submission['attributes']['country'])

# Convert the submission_countries list into a pandas Series
submission_countries_series = pd.Series(submission_countries)

# Count the occurrences of each country and store in a DataFrame
country_counts = submission_countries_series.value_counts().reset_index()
country_counts.columns = ['country', 'count']

print(f"Total unique countries found: {len(country_counts)}")
print("Top 5 submission countries:")
print(country_counts.head())
Step 3
!pip install pycountry
import plotly.express as px
import pandas as pd
import pycountry

# Function to get ISO-3 code from country name
def get_iso3(country_name):
    try:
        return pycountry.countries.search_fuzzy(country_name)[0].alpha_3
    except:
        return None

# Apply the function to get ISO-3 codes
country_counts['iso_alpha'] = country_counts['country'].apply(get_iso3)

# Create a world heatmap using Plotly Express with ISO-3 codes
fig = px.choropleth(country_counts.dropna(subset=['iso_alpha']), # Drop rows where ISO code couldn't be found
                      locations="iso_alpha", # Use ISO-3 codes for locations
                      locationmode='ISO-3',   # Specify ISO-3 location mode
                      color="count",
                      hover_name="country",
                      color_continuous_scale="Viridis",
                      title="Submission Country Heatmap")

# Update layout to increase figure size
fig.update_layout(
    autosize=False,
    width=1000, # Adjust width as needed
    height=600 # Adjust height as needed
)

fig.show()
World heatmap of submissions

More Examples

Here are some other examples of how you can use natural language to query the VirusTotal API:

  • Find malicious Chrome extensions (based on this blog):

This specific question is actually a really good example of how Google Colab will run the step-by-step process. As you can see, we asked different questions in the same prompt, for that reason Gemini will do it in different steps.

Retrieve all the samples from the /intelligence/search endpoint with the following query: fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight.

Get the bundled_files and create a new dataframe with them too, but store only those having 5 or more malicious value.

Then identify the top 5 magika of those bundled files obtained

Code generated:

Step 1
query_results_with_bundled_files = []

# Define the search query string
query = "fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight"

# Define the relationship to request
RELATIONS = "bundled_files"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_bundled_files.append(itemobj.to_dict())

# Convert query_results_with_bundled_files to a list called crx_files_with_bundled_files
crx_files_with_bundled_files = list(query_results_with_bundled_files)

# Print the number of found CRX files with bundled file data.
print(f"Found {len(crx_files_with_bundled_files)} CRX files with bundled file data matching the criteria.")
Step 2
import pandas as pd

# List to store all bundled file attributes
all_bundled_files_data = []

# Iterate through each CRX file with bundled file data
for crx_file in crx_files_with_bundled_files:
    if 'relationships' in crx_file and 'bundled_files' in crx_file['relationships']:
        for bundled_file in crx_file['relationships']['bundled_files']['data']:
            all_bundled_files_data.append(bundled_file.get('attributes', {}))

# Create a DataFrame from the extracted bundled file attributes
df_bundled_files = pd.json_normalize(all_bundled_files_data)

# Filter for bundled files with 5 or more malicious detections
df_malicious_bundled_files = df_bundled_files[df_bundled_files['last_analysis_stats.malicious'] >= 5]

print(f"Total bundled files extracted: {len(df_bundled_files)}")
print(f"Bundled files with 5 or more malicious detections: {len(df_malicious_bundled_files)}")

df_malicious_bundled_files.head()
Step 3
import pandas as pd
import plotly.express as px

# Check if the 'magika' column exists and is not empty in the filtered DataFrame
if 'magika' in df_malicious_bundled_files.columns and not df_malicious_bundled_files['magika'].empty:
    # Count the occurrences of each magika type
    magika_counts = df_malicious_bundled_files['magika'].value_counts().reset_index()
    magika_counts.columns = ['Magika Type', 'Count']

    # Get the top 5 most frequent magika types
    top_5_magika = magika_counts.head(5)

    print("Top 5 Magika Types in malicious bundled files:")
    print(top_5_magika)

    # Visualize the top 5 magika types
    fig = px.bar(top_5_magika,
                 x='Count',
                 y='Magika Type',
                 orientation='h',
                 title='Top 5 Magika Types in Malicious Bundled Files',
                 labels={'Count': 'Number of Occurrences', 'Magika Type': 'Magika Type'}) 
    fig.update_layout(yaxis={'categoryorder':'total ascending'}) # Order bars by count
    fig.show()
else:
    print("The 'magika' column was not found or is empty in the filtered malicious bundled files DataFrame. Cannot identify top magika types.")
  • Retrieve threat actors:
Retrieve threat actors targeting the United Kingdom with an espionage motivation. Sort the results in descending order of relevance. Display the total number of threat actors and their names.
  • Investigate campaigns:
Retrieve information about threat actors and malware involved in campaigns targeting Pakistan. For each threat actor, retrieve its country of origin, motivations, and targeted industries. For each malware, retrieve its name.

What’s next

This workshop, co-authored with Aleksandar from Sentinel LABS, will be presented at future conferences to show the community how to get the most out of the VirusTotal API. We'll be updating the content of our meta colab regularly and will share more information soon about how to get the Google Colab.

In the meantime, if you have any feedback or ideas to contribute, we are open to suggestions.