Wednesday, June 04, 2025

YARA-X 1.0.0: The Stable Release and Its Advantages

Audio version of this post, created with NotebookLM Deep Dive

Short note for everyone who already lives and breathes YARA:

Victor (aka plusvic) just launched YARA-X 1.0.0. Full details: https://virustotal.github.io/yara-x/blog/yara-x-is-stable/

What changes for you

Area	YARA 4.x	YARA-X
Engine	C/C++, manual memory	Rust, memory-safe
Rule compatibility	–	~99 % work as-is
Speed (regex / loops)	Can bottleneck scans	Often 5–10× faster
Error messages	Generic	Line-accurate, clearer
CLI	Plain text	Colour, JSON/YAML dump, shell completion
Future work	Bug-fix only	New features land here

Why move now

Performance – heavy rules (large regex, deep loops) finish seconds faster.
Safety – Rust core avoids the usual memory bugs and makes crashes rare.
Maintainability – parser and scanner are decoupled; easier to embed or extend.
Better tooling – built-in formatter (yara-x fmt), linter-friendly output.
Active roadmap – new language features will go to YARA-X only.

We already use YARA-X at VirusTotal for Livehunt and Retrohunt. Billions of files later, it behaves.

Give it a spin, report issues, and send feedback our way. Huge thanks to Victor for pushing the project this far. Let’s keep making pattern matching simpler and faster

What 17,845 GitHub Repos Taught Us About Malicious MCP Servers

Audio version of this post, created with NotebookLM Deep Dive

Spoiler: VirusTotal Code Insight’s preliminary audit flagged nearly 8% of MCP (Model Context Protocol) servers on GitHub as potentially forged for evil, though the sad truth is, bad intentions aren’t required to follow bad practices and publish code with critical vulnerabilities.

Before we get started, a quick personal note. A couple of weeks ago, I announced at Google that I’m stepping away from my role as a manager of managers and getting back to my roots, focusing on the VirusTotal community. And I’m not doing it alone. I’m joined by some legendary names from the project’s early days, like Julio, the very first VirusTotal developer and Víctor, creator of YARA and YARA-X. In this new chapter, we’re going deep into AI, not just evolving VT and using it to analyze typical threats but also to hunt down the new ones riding the AI wave, like malicious models and MCPs among others.

As many of you already know, MCP (Model Context Protocol) is a simple but powerful standard that lets large language models interact with external tools and APIs via JSON-RPC. Think of it as a universal adapter, MCP turns scripts, services, and data sources into callable functions that models like Claude, GPT or Gemini can use to answer complex queries or automate tasks. In just a few months, MCP has gone from niche to near-standard with native support across most major LLM platforms.

Before building and releasing our own MCP server for VirusTotal (which is coming very soon) we wanted to take a step back and understand how this protocol is being used in the wild. Specifically: are people already abusing it to build malicious plugins? And if so, how could we detect and classify these threats inside VT?

With that in mind, I set out to run a quick three-phase experiment (aka three humble python scripts). First, a harvesting phase to collect as many GitHub projects as possible by querying the API for MCP-related keywords like “model-context-protocol”, “server_mcp” or “define_mcp_tool”, among others. Then came a filtering step to isolate the interesting repos, not everything with "MCP" in the README is a real server implementation, so I built a scoring system to identify true servers based on dependency files, import statements, keywords in code, presence of mcp.json, and more. After applying that filter, we ended up with a focused dataset of 17,845 likely MCP server projects.

Finally, as the third phase, we ran a security review using VT Code Insight powered by Gemini 2.5 Flash and taking advantage of its 1-million token context window, speed, and code analysis skills to evaluate each project as a whole. We asked Code Insight for a basic verdict and to flag any High, Medium, or Low vulnerabilities. But after just a few hundred analyses we had to hit pause, Code Insight was surfacing so many issues that the results quickly became overwhelming. So we tightened things up with a second and more focused prompt, asking Code Insight to look specifically for signs of intentional malicious behavior along with reasoning that supported a conclusion of malice.

We let the new prompt run on the full dataset and Code Insight got to work. In the end, it marked 1,408 repositories as likely designed to be malicious. After checking some of these results by hand, two things were clear to me. First: there are many possible attack vectors that can be used through an MCP server. And second: Code Insight seems to trust human developers too much, it often assumes that some bad practices and the resulting critical bugs couldn’t be accidental.

“This pattern—creating a powerful, remotely triggerable code execution vulnerability and simultaneously preparing a collection of sensitive data (including data not needed for normal operation)—is characteristic of an intentional backdoor designed for data exfiltration and system compromise. The dynamic tool generation serves as a plausible cover for the unsafe use of `exec`.” Oh, Code Insight… if only you knew the kind of chaos vibe coding is causing. We’re going to be very busy in cybersecurity cleaning up after these accidental masterpieces

We’ve confirmed some of the flagged projects were just proof-of-concepts and security researcher demos, and many tiny “hello-world” examples were missing basic security features which Code Insight called out as “likely malicious”, because no sane developer would ship that to production. But even if you filter out the hobby projects, there’s still a scary amount of real attack vectors and critical vulnerabilities out there.

While we continue manually reviewing Code Insight’s reports to learn more about the issues and weak spots it uncovered, we also asked Gemini 2.5 Flash to help us categorize them. We provided it with the problem summaries from the 1,408 MCP-related repositories flagged as potentially problematic, and asked for a simple list, just a brief enumeration of the attack techniques involved. Gemini came back with the following list:

Attack vector	Example Indicators
Malicious-Server Supply Chain	Self-update scripts, install hooks from non-canonical URLs, latest tag pulls.
Rogue Server / Impersonation	Hard-coded IPs or typo-squatted domains, no TLS/mTLS verification.
Credential Harvesting	Code that reads ~/.aws, Keychain, or env vars and posts to external endpoint.
Tool-Based RCE & File Ops	subprocess, exec, or rm -rf paths built from LLM/user input.
Server-Side Command Injection	Server concatenates JSON-RPC params into shell/SQL without escaping.
Semantic-Gap Poisoning	Manifest says “read-only”; implementation writes files or opens sockets.
Over-broad Permissions	OAuth scopes * / “full_access”, multiple data silos bridged in one tool.
Indirect Prompt Injection	HTML comments, zero-width chars, or Base64 blobs returned to the host.
Context/Data Poisoning	Unvalidated web-scrape fed straight into context= parameter.
Sampling-Feature Abuse	Server requests giant completions before any other call; leaks system prompt.
Living-Off-The-Land	Malicious server does nothing but orchestrate trusted tools already installed.
Chained MCP Exploitation	Output from Server A becomes params for Server B within one loop.
Financial-Fraud Tools / DoS / Persistence	Payment APIs with LLM-supplied dest-IDs, infinite loops without rate limits, hot-swapped binaries.

If you're building or defending around MCPs, there are a few quick wins to keep things safer:

treat MCP servers like browser extensions (sign, hash, and pin specific versions)
isolate them in containers or WASM sandboxes with strict file and network limits
make permissions visible and revocable through a clear, zero-trust-style UI
and never let model outputs go unfiltered, strip out sneaky stuff like invisible characters, HTML comments, or rogue script tags before looping anything back into your LLM.

MCPs are growing fast (almost 18,000 servers already in the wild), and with that growth comes a mountain of security debt. The good news? We’ll soon be launching a dedicated feature in VirusTotal to analyze MCP servers.
Stay tuned… we’re just getting started

Thursday, January 09, 2025

detection engineering, sigma, sysmon, threat hunting, threat intelligence

Research that builds detections

Note: You can view the full content of the blog here.

Introduction

Detection engineering is becoming increasingly important in surfacing new malicious activity. Threat actors might take advantage of previously unknown malware families - but a successful detection of certain methodologies or artifacts can help expose the entire infection chain.

In previous blog posts, we announced the integration of Sigma rules for macOS and Linux into VirusTotal, as well as ways in which Sigma rules can be converted to YARA to take advantage of VirusTotal Livehunt capabilities. In this post, we will show different approaches to hunt for interesting samples and derive new Sigma detection opportunities based on their behavior.

Tell me what role you have and I'll tell you how you use VirusTotal

VirusTotal is a really useful tool that can be used in many different ways. We have seen how people from SOCs and Incident Response teams use it (in fact, we have our VirusTotal Academy videos for SOCs and IRs teams), and we have also shown how those who hunt for threats or analyze those threats can use it too.

But there's another really cool way to use VirusTotal - for people who build detections and those who are doing research. We want to show everyone how we use VirusTotal in our work. Hopefully, this will be helpful and also give people ideas for new ways to use it themselves.

To explain our process, we used examples of Lummac and VenomRAT samples that we found in recent campaigns. These caught our attention due to some behaviors that had not been identified by public detection rules in the community. For that reason we have created two Sigma rules to share with the community, but if you want to get all the details about how we identified it and started our research, go to our Google Threat Intelligence community blog.

Our approach

As detection engineers, it is important to look for techniques that can be in use by multiple threat actors - as this makes tracking malicious activity more efficient. Prior to creating those detections, it is best to check existing research and rule collections, such as the Sigma rules repository. This can save time and effort, as well as provide insight into previously observed samples that can be further researched.

A different approach would be to instead look for malicious files that are not detected by existing Sigma rules, since they can uncover novel methodologies and provide new opportunities for detection creation.

One approach is to hunt for files that are flagged by at least five different AV vendors, were recently uploaded within the last month, have sandbox execution (in order to view their behavior), and which have not triggered any Crowdsourced Sigma rules.

p:5+ have:behavior fs:30d+ not have:sigma

This initial query can be adapted to incorporate additional filters that the researcher may find relevant. These could include modifiers to identify for example, the presence of the PowerShell process in the list of executed processes (behavior_created_processes:powershell.exe), filtering results to only include documents (type:document), or identifying communication with services like Pastebin (behavior_network:pastebin.com).

Another way to go is to look at files that have been flagged by at least five AV’s and were tested in either Zenbox or CAPE. These sandboxes often have great logs produced by Sysmon, which are really useful for figuring out how to spot these threats. Again, we'd want to focus on files uploaded in the last month that haven't triggered any Sigma rules. This gives us a good starting point for building new detection rules.

p:5+ (sandbox_name:"CAPE Sandbox" or sandbox_name:"Zenbox") fs:30d+ not have:sigma

Lastly, another idea is to look for files that have not triggered many high severity detections from the Sigma Crowdsourced rules, as these can be more evasive. Specifically, we will look for samples with zero critical, high or medium alerts - and no more than two low severity ones.

p:5+ have:behavior fs:30d+ sigma_critical:0 sigma_high:0 sigma_medium:0 sigma_low:2-

With these queries, we can start investigating some samples that may be interesting to create detection rules.

Our detections for the community

Our approach helps us identify behaviors that seem interesting and worth focusing on. In our blog, where we explain this approach in detail, we highlighted two campaigns linked to Lummac and VenomRAT that exhibited interesting activity. Because of this, we decided to share the Sigma rules we developed for these campaigns. Both rules have been published in Sigma's official repository for the community.

Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer

Sigma rule on GitHub: https://github.com/SigmaHQ/sigma/blob/master/rules-emerging-threats/2024/Malware/Lummac-Stealer/proc_creation_win_malware_lummac_more_vbc.yml

title: Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer
  id: 19b3806e-46f2-4b4c-9337-e3d8653245ea
  status: experimental
  description: Detects the execution of more.com and vbc.exe in the process tree. This behaviors was observed by a set of samples related to Lummac Stealer. The Lummac payload is injected into the vbc.exe process.
  references:
      - https://www.virustotal.com/gui/file/14d886517fff2cc8955844b252c985ab59f2f95b2849002778f03a8f07eb8aef
      - https://strontic.github.io/xcyclopedia/library/more.com-EDB3046610020EE614B5B81B0439895E.html
      - https://strontic.github.io/xcyclopedia/library/vbc.exe-A731372E6F6978CE25617AE01B143351.html
  author: Joseliyo Sanchez, @Joseliyo_Jstnk
  date: 2024-11-14
  tags:
      - attack.defense-evasion
      - attack.t1055
  logsource:
      category: process_creation
      product: windows
  detection:
      # VT Query: behaviour_processes:"C:\\Windows\\SysWOW64\\more.com" behaviour_processes:"C:\\Windows\\Microsoft.NET\\Framework\\v4.0.30319\\vbc.exe"
      selection_parent:
          ParentImage|endswith: '\more.com'
      selection_child:
          - Image|endswith: '\vbc.exe'
          - OriginalFileName: 'vbc.exe'
      condition: all of selection_*
  falsepositives:
      - Unknown
  level: high

Sysmon event for: Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer

{
  "System": {
    "Provider": {
      "Guid": "{5770385F-C22A-43E0-BF4C-06F5698FFBD9}",
      "Name": "Microsoft-Windows-Sysmon"
    },
    "EventID": 1,
    "Version": 5,
    "Level": 4,
    "Task": 1,
    "Opcode": 0,
    "Keywords": "0x8000000000000000",
    "TimeCreated": {
      "SystemTime": "2024-11-26T16:23:05.132539500Z"
    },
    "EventRecordID": 692861,
    "Correlation": {},
    "Execution": {
      "ProcessID": 2396,
      "ThreadID": 3116
    },
    "Channel": "Microsoft-Windows-Sysmon/Operational",
    "Computer": "DESKTOP-B0T93D6",
    "Security": {
      "UserID": "S-1-5-18"
    }
  },
  "EventData": {
    "RuleName": "-",
    "UtcTime": "2024-11-26 16:23:05.064",
    "ProcessGuid": "{C784477D-F5E9-6745-6006-000000003F00}",
    "ProcessId": 4184,
    "Image": "C:\\Windows\\Microsoft.NET\\Framework\\v4.0.30319\\vbc.exe",
    "FileVersion": "14.8.3761.0",
    "Description": "Visual Basic Command Line Compiler",
    "Product": "Microsoft® .NET Framework",
    "Company": "Microsoft Corporation",
    "OriginalFileName": "vbc.exe",
    "CommandLine": "C:\\Windows\\Microsoft.NET\\Framework\\v4.0.30319\\vbc.exe",
    "CurrentDirectory": "C:\\Users\\george\\AppData\\Roaming\\comlocal\\RUYCLAXYVMFJ\\",
    "User": "DESKTOP-B0T93D6\\george",
    "LogonGuid": "{C784477D-9D9B-66FF-6E87-050000000000}",
    "LogonId": "0x5876e",
    "TerminalSessionId": 1,
    "IntegrityLevel": "High",
    "Hashes": {
      "SHA1": "61F4D9A9EE38DBC72E840B3624520CF31A3A8653",
      "MD5": "FCCB961AE76D9E600A558D2D0225ED43",
      "SHA256": "466876F453563A272ADB5D568670ECA98D805E7ECAA5A2E18C92B6D3C947DF93",
      "IMPHASH": "1460E2E6D7F8ECA4240B7C78FA619D15"
    },
    "ParentProcessGuid": "{C784477D-F5D4-6745-5E06-000000003F00}",
    "ParentProcessId": 6572,
    "ParentImage": "C:\\Windows\\SysWOW64\\more.com",
    "ParentCommandLine": "C:\\Windows\\SysWOW64\\more.com",
    "ParentUser": "DESKTOP-B0T93D6\\george"
  }
}

File Creation Related To RAT Clients

Sigma rule on GitHub: https://github.com/SigmaHQ/sigma/blob/fad4742996c55d8d4663e611f84877a2b741dc46/rules-emerging-threats/2024/Malware/Generic/file_event_win_malware_generic_creation_configuration_rats.yml

title: File Creation Related To RAT Clients
  id: 2f3039c8-e8fe-43a9-b5cf-dcd424a2522d
  status: experimental
  description: File .conf created related to VenomRAT, AsyncRAT and Lummac samples observed in the wild.
  references:
      - https://www.virustotal.com/gui/file/c9f9f193409217f73cc976ad078c6f8bf65d3aabcf5fad3e5a47536d47aa6761
      - https://www.virustotal.com/gui/file/e96a0c1bc5f720d7f0a53f72e5bb424163c943c24a437b1065957a79f5872675
  author: Joseliyo Sanchez, @Joseliyo_Jstnk
  date: 2024-11-15
  tags:
      - attack.execution
  logsource:
      category: file_event
      product: windows
  detection:
      # VT Query: behaviour_files:"\\AppData\\Roaming\\DataLogs\\DataLogs.conf"
      # VT Query: behaviour_files:"DataLogs.conf" or behaviour_files:"hvnc.conf" or behaviour_files:"dcrat.conf"
      selection_required:
          TargetFilename|contains: '\AppData\Roaming\'
      selection_variants:
          TargetFilename|endswith:
              - '\datalogs.conf'
              - '\hvnc.conf'
              - '\dcrat.conf'
          TargetFilename|contains:
              - '\mydata\'
              - '\datalogs\'
              - '\hvnc\'
              - '\dcrat\'
      condition: all of selection_*
  falsepositives:
      - Legitimate software creating a file with the same name
  level: high

Sysmon event for: File Creation Related To RAT Clients

{
  "System": {
    "Provider": {
      "Guid": "{5770385F-C22A-43E0-BF4C-06F5698FFBD9}",
      "Name": "Microsoft-Windows-Sysmon"
    },
    "EventID": 11,
    "Version": 2,
    "Level": 4,
    "Task": 11,
    "Opcode": 0,
    "Keywords": "0x8000000000000000",
    "TimeCreated": {
      "SystemTime": "2024-12-02T00:52:23.072811600Z"
    },
    "EventRecordID": 1555690,
    "Correlation": {},
    "Execution": {
      "ProcessID": 2624,
      "ThreadID": 3112
    },
    "Channel": "Microsoft-Windows-Sysmon/Operational",
    "Computer": "DESKTOP-B0T93D6",
    "Security": {
      "UserID": "S-1-5-18"
    }
  },
  "EventData": {
    "RuleName": "-",
    "UtcTime": "2024-12-02 00:52:23.059",
    "ProcessGuid": "{C784477D-04C6-674D-5C06-000000004B00}",
    "ProcessId": 7592,
    "Image": "C:\\Users\\george\\Desktop\\ezzz.exe",
    "TargetFilename": "C:\\Users\\george\\AppData\\Roaming\\MyData\\DataLogs.conf",
    "CreationUtcTime": "2024-12-02 00:52:23.059",
    "User": "DESKTOP-B0T93D6\\george"
  }

Wrapping up

Detection engineering teams can proactively create new detections by hunting for samples that are being distributed and uploaded to our platform. Applying our approach can benefit in the development of detection on the latest behaviors that do not currently have developed detection mechanisms. This could potentially help organizations be proactive in creating detections based on threat hunting missions.

The Sigma rules created to detect Lummac activity have been used during threat hunting missions to identify new samples of this family in VirusTotal. Another use is translating them into the language of the SIEM or EDR available in the infrastructure, as they could help identify potential behaviors related to Lummac samples observed in late 2024. After passing quality controls and being published on Sigma's public GitHub, they have been integrated for use in VirusTotal, delivering the expected results. You can use them in the following way:

Lummac Stealer Activity - Execution Of More.com And Vbc.exe

sigma_rule:a1021d4086a92fd3782417a54fa5c5141d1e75c8afc9e73dc6e71ef9e1ae2e9c

File Creation Related To RAT Clients

sigma_rule:8f179585d5c1249ab1ef8cec45a16d112a53f91d143aa2b0b6713602b1d19252

We hope you found this blog interesting and useful, and as always we are happy to hear your feedback.

Important Update: IP Address Change for VirusTotal

We're making a change to the IP address for www.virustotal.com. If you're currently whitelisting our IP address in your firewall or proxy, you'll need to update your rules to maintain access to VirusTotal.

Starting November 25th, we'll be gradually transitioning the resolution of www.virustotal.com to a new IP address: 34.54.88.138. If you have hardcoded the previous IP address (74.125.34.46) in your firewall or proxy, you'll need to update your configuration to include the new IP address. This will ensure continued access to VirusTotal.

TLS Certificate provider change:

We're also updating our TLS certificate provider, moving from a DigiCert wildcard certificate to Google Trust Services single-host certificate. While this change should be seamless for most users, you'll need to update your configuration if you validate the certificate's signer or subject.

Note for Big Files API Users:

If you use the Big Files endpoint (https://docs.virustotal.com/reference/files-upload-url) for submitting files larger than 32MB, remember that it provides a URL pointing to the bigfiles.virustotal.com domain.

This domain is managed by a ghs.googlehosted.com load balancer, which uses dynamic IP address resolution. Please ensure your firewall rules can accommodate this.

We'll be implementing this change gradually starting on November 25th to minimize any potential disruption.

We understand that this change may require adjustments to your systems, and we appreciate your prompt attention to this matter. If you have any questions or concerns, please don't hesitate to contact us.

Friday, October 18, 2024

behavior, https, hunting, infrastructure, ja3, ja4, malware behavior, network capture, pivot, sandbox, threat hunting, tls, yara

Unveiling Hidden Connections: JA4 Client Fingerprinting on VirusTotal

VirusTotal has incorporated a powerful new tool to fight against malware: JA4 client fingerprinting. This feature allows security researchers to track and identify malicious files based on the unique characteristics of their TLS client communications.

JA4: A More Robust Successor to JA3

JA4, developed by FoxIO, represents a significant advancement over the older JA3 fingerprinting method. JA3's effectiveness had been hampered by the increasing use of TLS extension randomization in https clients, which made fingerprints less consistent. JA4 was specifically designed to be resilient to this randomization, resulting in more stable and reliable fingerprints.

Unveiling the Secrets of the Client Hello

JA4 fingerprinting focuses on analyzing the TLS Client Hello packet, which is sent unencrypted from the client to the server at the start of a TLS connection. This packet contains a treasure trove of information that can uniquely identify the client application or its underlying TLS library. Some of the key elements extracted by JA4 include:

TLS Version: The version of TLS supported by the client.
Cipher Suites: The list of cryptographic algorithms the client can use.
TLS Extensions: Additional features and capabilities supported by the client.
ALPN (Application-Layer Protocol Negotiation): The application-level protocol, such as HTTP/2 or HTTP/3, that the client wants to use after the TLS handshake.

JA4 in Action: Pivoting and Hunting on VirusTotal

VirusTotal has integrated JA4 fingerprinting into its platform through the behavior_network file search modifier. This allows analysts to quickly discover relationships between files based on their JA4 fingerprints.

To find the JA4 value, navigate to the "behavior" section of the desired sample and locate the TLS subsection. In addition to JA4, you might also find JA3 or JA3S there.

Example Search: Let's say you've encountered a suspicious file that exhibits the JA4 fingerprint "t10d070600_c50f5591e341_1a3805c3aa63" during VirusTotal's behavioral analysis.

You can click on this JA4 to pivot using the search query behavior_network:t10d070600_c50f5591e341_1a3805c3aa63 finding other files with the same fingerprint This search will pivot you to additional samples that share the same JA4 fingerprint, suggesting they might be related. This could indicate that these files are part of the same malware family or share a common developer or simply share a common TLS library.

Wildcard Searches

To broaden your search, you can use wildcards within the JA4 hash. For instance, the search: behaviour_network:t13d190900_*_97f8aa674fd9

Returns files that match the JA4_A and JA4_C components of the JA4 hash while allowing for variations in the middle section, which often corresponds to the cipher suite. This technique is useful for identifying files that might use different ciphers but share other JA4 characteristics.

YARA Hunting Rules: Automating JA4-Based Detection

YARA hunting rules using the "vt" module can be written to automatically detect files based on their JA4 fingerprints. Here's an example of a YARA rule that targets a specific JA4 fingerprint:

This rules will flag any file submitted to VirusTotal that exhibits the matching JA4 fingerprint. The first example only matches "t12d190800_d83cc789557e_7af1ed941c26" during behavioral analysis. The second rule will match a regular expression /t10d070600_.*_1a3805c3aa63/, only matching JA4_A and JA4_C components, excluding the JA4_B cipher suite. These fingerprints could be linked to known malware, a suspicious application, or any TLS client behavior that is considered risky by security analysts.

A few Interesting JA4 Hash examples

Description	JA4	Example SHA256
Linux miner, trojan	t12d5908h1_7bd0586cbef7_046e095b7c4a	caed9b2d91f5802da4b1844068e7df971d50a11411ff2a792aedce96554539f9
GoLang	t13d190900_9dc949149365_97f8aa674fd9	00b001f5d30e7a51bf9eced4e41267912353153dcc52605a737a6778aaecfbfb
SnakeLogger / Redline	t10d070600_c50f5591e341_1a3805c3aa63	03461c2a07431aed5ff68bbcf42d7ef82f32190b44ba140befd3f474614b5f3d

JA4: Elevating Threat Hunting on VirusTotal

VirusTotal's adoption of JA4 client fingerprinting will provide users with an invaluable tool for dissecting and tracking TLS client behaviors, leading to enhanced threat hunting, pivoting, and more robust malware identification.

Happy Hunting.

VirusTotal AI-Generated Conversations: Threat Intel Made Easy

At VirusTotal, we're constantly exploring new ways to make threat intelligence more digestible and available to a wider audience. Our latest effort leverages the power of AI to create easily understood audio discussions from technical information.

Using Google NotebookLM's innovative Audio Overview feature, we're transforming technical content into accessible audio experiences to make threat intelligence engaging and understandable for everyone. Instead of just reading, now you can listen and learn about complex cybersecurity concepts, no matter your level of expertise.

Our first AI-generated conversation is based on two blog posts (1, 2) where we previously discussed how Google's Gemini AI is being utilized for analyzing suspicious binaries. It's a meta experience, to be sure: Gemini, discussing its own role in analyzing malware. Listen to the full conversation below:

This initial experiment highlights how AI can make complex technical content more engaging and accessible. We are exploring the potential of incorporating similar AI-generated conversations into all future blog posts, providing another way to unpack and discuss the information we share. Stay tuned!

Thursday, August 29, 2024

research, threat hunting, vt intelligence

Exploring the VirusTotal Dataset | An Analyst's Guide to Effective Threat Research

By Aleksandar Milenkoski (SentinelOne) and Jose Luis Sánchez Martínez

VirusTotal stores a vast collection of files, URLs, domains, and IPs submitted by users worldwide. It features a variety of functionalities and integrates third-party detection engines and tools to analyze the maliciousness of submitted artifacts and gather relevant related information, such as file properties, domain registrars, and execution behaviors.

The VirusTotal dataset, the backbone of the platform, structures artifact-related information into objects and represents relevant relationships between them, providing contextual links between various artifacts. This makes VirusTotal a valuable resource for threat research, enabling users to perform activities such as clustering artifacts related to specific threat actors or campaigns, tracking malicious activities, and analyzing trends in the threat landscape.

In this post, part of a collaborative effort between VirusTotal and SentinelLabs, we explore how to effectively use VirusTotal’s wide range of querying capabilities, highlight scenarios in which these capabilities return informative results, and discuss factors that may impact the completeness or relevance of the data.

The content is aimed at VirusTotal users seeking to better understand the fundamental inner workings of the platform and how to effectively use it as part of their investigations. This contribution complements the comprehensive VirusTotal documentation by discussing certain aspects in greater detail along with a summary of relevant context and usage information, and demonstrating how VirusTotal capabilities are applied in real-world cases.

Overview

The VirusTotal platform analyzes files and network-related artifacts (URLs, domains, and IPs) submitted to the platform to detect maliciousness. The platform aggregates results from third-party detection engines, web scanners, and other tools to provide thorough analysis overviews.

VirusTotal stores submitted artifacts as well as information related to each artifact in a dataset, which we refer to as the VirusTotal dataset. The artifact-related information is extensive and diverse, including, for example, file properties such as filename, file type, digital signatures, and hashes, as well as URL components such as domains, URL paths, and URL query parameters.

VirusTotal provides interfaces for users to interact with the platform and search filters for querying it. The search filters allow for retrieving and pivoting through artifact-related information. Additionally, they enable clustering multiple artifacts and identifying newly submitted ones based on user-defined queries aimed at capturing overlaps in content or related information. This has many use cases in threat intelligence, such as identifying trends in the threat landscape, commonalities between different threats, and tracking specific threat groups or campaigns.

Below, we first provide an overview of the VirusTotal dataset structure and the different interfaces available for interacting with it, with a focus on querying the dataset using search filters. We then delve into search modifiers, a specific type of VirusTotal search filter, highlighting modifiers for querying data generated by artificial intelligence (AI) technology. Next, we discuss factors that may impact the relevance or completeness of results when querying using search modifiers, including an example of using search modifiers in an actual threat research investigation.

The VirusTotal Dataset

The VirusTotal dataset is the backbone of the VirusTotal platform. Current records indicate that it stores a vast amount of submitted artifacts and related information, including over 50 billion files, 6 billion URLs, and 4 billion domains. The data is stored in a structured and hierarchical manner.

The top-level structure of the VirusTotal dataset

Artifact-related information is structured into objects, which have an ID, a type, and attributes. Optionally, objects may also have one or multiple relationships.

VirusTotal objects

The object ID uniquely identifies an object. An object can be directly related to a submitted artifact: a file, URL, domain, or IP address. In this case, the object's ID is derived from the artifact itself — the SHA-256 hash for files, the IP address or domain itself for IPs or domains, and the SHA-256 hash or the Base-64 encoded form for URLs.

An object type indicates the type of information stored by an object. For example, an object of type file stores file-specific information about submitted files, such as filenames, file hashes, and the file extension, whereas an object of type domain stores domain-specific information about submitted domains, such as the domain’s registrar. The objects of type file, url, ip, or domain are directly related to submitted artifacts, while the rest, such as threat_actor or reference, are not.

An attribute is a data item that stores information related to an object and can be of a primitive or a complex data type, such as an array or a structure. For example, the file object (an object of type file) includes the string attribute sha256. This attribute stores the SHA-256 hash of a submitted file. Additionally, it may contain the attribute lnk_info, specifically present for Windows Shortcut (LNK) files. lnk_info is a structure containing information specific to LNK files and extracted from the submitted file, such as the date at which the shortcut had been created.

Relationships signify connections between objects of the same or different types, making them particularly useful for describing scenarios involving multiple artifacts. For instance, the malicious file config.lnk file (SHA-256 hash 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61) communicates with the IP address 149.51.230[.]198 to download a payload.

In the context of VirusTotal, this relationship is represented through the communicating_files relationship of the ip_address object, which is directly related to 149.51.230[.]198. communicating_files stores information about all files, in the form of file objects, observed to communicate with the IP address during sandbox execution. VirusTotal executes submitted executable files in sandboxes to capture behaviors and artifacts visible only while the file is executing, such as started processes, network communications, changes to the file system, or strings present in process memory.

The communicating_files relationship

Top-level collections group all objects of the same type. The top-level collections that VirusTotal currently implements are files (a set of all objects of type file), urls (a set of all objects of type url), ips (a set of all objects of type ip), domains (a set of all objects of type domain), collections (a set of all objects of type collection), threat actors (a set of all objects of type threat_actor), and references (a set of all objects of type reference).

The object of type collection is not to be confused with top-level collections. This object groups multiple objects of the same or different types given a user-specified context, such as a threat actor, a malicious campaign, or a malware family.

The top-level collections enable operations that relate to the set of all objects of a given type. Such an operation is submitting a new file for analysis, which adds an object of type file to the files collection.

Querying VirusTotal

VirusTotal exposes two interfaces for interacting with its dataset: the platform’s graphical user interface (GUI) for manual interaction and the application program interface (API) for programmatic interaction.

The VirusTotal GUI

The VirusTotal GUI is the web interface of the platform. To query VirusTotal using the GUI, users enter a search query into the search field. A search query is composed of one or multiple search filters.

A search filter can be a value uniquely identifying a submitted artifact – a URL, domain, file hash (MD5, SHA-1, or SHA-256), or an IP address. This filter cannot be combined with other filters and is used for retrieving information related to only a single submitted artifact.

When a user inputs a value, VirusTotal retrieves the corresponding artifact and related information from its dataset and displays this data to the user as a web analysis report. To retrieve the artifact and its related information, VirusTotal searches its dataset for an object that has an ID or an attribute (for MD5 or SHA-1 hashes) that matches the user-provided value. The platform also explores relationships to and from this object.

Web analysis report (search query: 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61)

Querying using value uniquely identifying a submitted artifact

A search filter can also be a search modifier in the format modifier:value, where value may be a predefined or a user-specified search criterion.

Each modifier is mapped to one or more of the top-level collections files, urls, ips, domains, and collections, forming sets of file-, URL-, IP-, domain-, and collection-specific search modifiers. Further information on these modifiers can be found in the official VirusTotal documentation.

Multiple modifiers can be combined into more complex search queries using the logical operators AND, OR, and NOT. Parentheses can be used to group modifiers and logical operators, allowing for more precise queries by controlling the order of operations. All modifiers within a search query must be mapped to a single top-level collection.

A special modifier is entity, which defines the top-level collection to which the search query is applied. For example, entity:file applies the search query to the files collection, and entity:url applies the query to the urls collection. If a user does not explicitly specify the entity modifier, the platform defaults to entity:file.

Structured overview of commonly used file-specific search modifiers (entity:file)

For each modifier, VirusTotal searches for and retrieves objects within the collection that the modifier is mapped to. These objects have an ID or attribute, or a relationship to another object with an ID or attribute, that meets the search criterion. For example, entity:domain AND domain:test instructs VirusTotal to retrieve all objects of type domain whose IDs (the domains themselves) contain the string test. An exception is the content modifier, which instructs the platform to search through the content of submitted files.

In the case of a complex search query combining multiple modifiers using AND, OR, and/or NOT, VirusTotal combines the retrieved objects for each modifier into a resulting set that meets the combined criteria.

The platform then displays an overview of the resulting set of objects to the user in the form of a list of web analysis results. When a user clicks on an item in the list, the platform generates a web analysis report based on the corresponding object’s ID, as described previously.

List of web analysis results (search query: entity:domain AND domain:test)

Querying using search modifiers

In addition to scoping searches to a specific top-level collection, VirusTotal also uses the entity modifier to disambiguate between search modifiers mapped to more than one collection, such as fs (first submission date). For example, entity:file AND fs:2024-07-15 instructs VirusTotal to search the files collection for file objects where the first_submission_date attribute is set to July 15, 2024. In contrast, entity:url AND fs:2024-07-15 directs VirusTotal to search the urls collection for url objects where first_submission_date is set to the same date.

The VirusTotal API

A basic way to query VirusTotal using the API is to issue HTTP GET requests to API endpoints exposed by the platform and specify search filters as part of the request URLs. VirusTotal implements multiple endpoints, such as the following:

/api/v3/intelligence/search?query={query}: This endpoint allows querying VirusTotal in the same manner as the GUI, using a value that uniquely identifies a submitted artifact (a URL, domain, IP address, or file hash) or search modifiers. Example request URLs are https://www.virustotal.com/api/v3/intelligence/search?query=test.com and https://www.virustotal.com/api/v3/intelligence/search?query=entity:domain+and+domain:test.
/api/v3/files/{hash}: This endpoint retrieves the file object whose md5, sha1, or sha256 attribute matches the user-provided value. An example request URL is https://www.virustotal.com/api/v3/files/e6adf40a959308ea9de69699c58d2f25.

Querying using the /api/v3/files/{hash} API endpoint

VirusTotal returns JSON-formatted data in response to API requests. Users can parse this data and use it for additional actions, such as further querying and pivoting through the VirusTotal dataset.

Web requests can be issued using various methods, such as HTTP client libraries, command-line tools, or custom scripts. vt-py, the official Python library for the VirusTotal API, simplifies the process of sending web requests to endpoints and handling the responses, enabling users to perform various tasks programmatically.

The API vs. The GUI

There are several key differences between the VirusTotal GUI and API, particularly regarding scalability and the scope of available information.

The programmatic use of the API enables users to conduct large-scale querying of VirusTotal, which is not achievable through manual use of the GUI. For example, retrieving the names of the processes started by all Windows Shortcut files submitted to VirusTotal over 2024 is a task that is practically feasible only using the API.

Further, not all data stored in the VirusTotal dataset can be used as part of search queries using the GUI, which limits its querying capacity. For example, there are no search modifiers allowing users to query for URLs constructed in process memory during sandbox execution that submitted files have not contacted, such as secondary C2 URLs, which are contacted if communication with the primary C2 server fails. Although such search modifiers can be useful during investigations, the VirusTotal GUI displays these URLs in the Memory pattern URLs section of web analysis reports without providing a method to query for them directly.

The API endpoint /api/v3/files/{hash}/behaviours retrieves sandbox-generated data for files specified by their hash value. URLs discovered in process memory are stored in the memory_pattern_urls field returned by the endpoint. For files that meet other search criteria, users can programmatically extract the URLs and keep the file in consideration if a URL aligns with a specific search requirement.

memory_pattern_urls values

The API may provide more information than what is visible to users in the GUI. For example, there can be discrepancies in sandbox-generated data provided to users through the GUI and the API. For example, the /api/v3/files/{hash}/behaviours endpoint retrieves all data generated by the sandbox CAPE for the file with the user-specified hash, including details on the suspicious behavior rules triggered by the file during execution. This information is not provided to users in the GUI.

CAPE suspicious behavior rules retrieved by api/v3/files/{hash}/behaviours

AI Search Modifiers

VirusTotal leverages artificial intelligence (AI) to generate natural language summaries of the functionalities of code in executable files submitted to the platform, such as scripts, Microsoft Office documents, or binary files. This feature is particularly beneficial for malware analysis, assisting analysts in understanding the capabilities of malware under investigation.

VirusTotal integrates AI engines into its pipeline for analyzing submitted files. These engines use large language models (LLMs) trained on programming languages, which enable them to analyze and translate code into natural language summaries of its functionalities. Some AI engines also generate verdicts, which are labels categorizing analyzed code as benign, suspicious, or malicious. VirusTotal supports two types of engines: Code Insight and Crowdsourced AI.

Code Insight is VirusTotal's in-house AI engine implementation, based on Google’s Gemini. Crowdsourced AI is a collection of third-party AI engines contributed by the community and is continuously enriched with new additions.

Depending on their training and design, the AI engines specialize in analyzing specific file types. For example, the ByteDefend AI engine is designed to analyze macro code in Microsoft Office files, including Word, Excel, and PowerPoint documents.

The Code Insight engine focuses on script files, such as PowerShell, Python, and Ruby scripts. It excludes from analysis any script files that exceed a set file size or similarity threshold (these values are currently undocumented). The VirusTotal platform compares each script's code with scripts previously analyzed by the AI engine and calculates a similarity value. This value is then evaluated against the similarity threshold.

While we were writing this post, Google announced that Code Insight will also support Windows Portable Executable (PE) binary files. This feature is enabled by three interconnected phases:

Unpacking binaries using the malware analysis service Mandiant Backscatter: This step reveals the underlying code of potentially obfuscated (packed) malicious binaries submitted to VirusTotal, which is the intended subject of analysis.
Decompilation of the unpacked binaries using Hex-Rays IDA Pro decompilers: This step translates the assembly code of the unpacked binaries into decompiled code written in a higher-level programming language (pseudocode). Gemini analyzes decompiled code with greater efficiency compared to assembly code due to its conciseness.
Analysis of the decompiled code with Gemini: This step generates a natural language summary of the decompiled code’s functionalities.

The Code Insight analysis workflow

The summaries generated by Code Insight and Crowdsourced AI engines for a given file are stored in the analysis fields of the crowdsourced_ai_results attribute of the corresponding object of type file. The optional verdict fields of this attribute store the verdicts, while the source fields indicate the AI engines that have analyzed the code within the file.

The crowdsourced_ai_results attribute

Users can query VirusTotal for specific verdicts or content in AI-generated summaries using search modifiers designed for that purpose. These modifiers are mapped to the files top-level collection.

For each AI search modifier, VirusTotal searches the platform’s dataset for file objects whose analysis and/or verdict fields of the crowdsourced_ai_results attribute meet the user-specified criterion. In addition, based on the source field, each AI search modifier focuses the search on summaries or verdicts generated by either Code Insight, specific Crowdsourced AI engines, or all available AI engines.

Search modifier	Usage and search scope
codeinsight	codeinsight:[text] Searches for text in summaries generated by Code Insight.
crowdsourced_ai_analysis	crowdsourced_ai_analysis:[text] Searches for text in summaries generated by Code Insight and all Crowdsourced AI engines.
crowdsourced_ai_verdict	crowdsourced_ai_verdict:[benign\|suspicious\|malicious] Searches for benign, suspicious or malicious verdicts generated by Code Insight and all Crowdsourced AI engines.
[ENGINE]_ai_analysis	[ENGINE]_ai_analysis:[content] Searches for text in summaries generated by a single Crowdsourced AI engine. [ENGINE] is the identifier for a specific engine, such as hispasec (hispasec_ai_analysis).
[ENGINE]_ai_verdict	[ENGINE]_ai_verdict:[benign\|suspicious\|malicious] Searches for benign, suspicious or malicious verdicts generated by a single Crowdsourced AI engine.

VirusTotal introduces new engine-specific search modifiers ([ENGINE]_ai_analysis and [ENGINE]_ai_verdict) as new engines are incorporated into Crowdsourced AI. For example, with the addition of the ByteDefend engine, the platform released two new search modifiers: bytedefend_ai_analysis and bytedefend_ai_verdict.

The AI search modifiers can be combined with other AI search modifiers or with any other modifiers supported by VirusTotal using the logical operators AND, OR, and NOT. For example, the search query crowdsourced_ai_analysis:"inject" AND crowdsourced_ai_analysis:"explorer.exe" can be used to identify files that perform injection involving the explorer.exe process. The results returned from VirusTotal include the PowerShell script da.ps1, which injects code from an external file into this process. This functionality of the script is documented in the summary generated by the Code Insight AI engine.

da.ps1 injects code into explorer.exe

Code Insight analysis of da.ps1

Another example is the search query crowdsourced_ai_analysis:"Shell.Run" AND behavior_created_processes:"powershell.exe". This query can be used to identify files that invoke the Run function of the Windows Script Host Shell object to execute the PowerShell process powershell.exe for conducting further activities. The results returned from VirusTotal include the Visual Basic script 297641663, which executes a PowerShell command using the Run function to download a payload from a remote server.

297641663 executes powershell.exe

Code Insight analysis of 297641663

Although the AI engines integrated into VirusTotal provide valuable insights, they should be used as tools to assist in malware analysis efforts, as part of a broader analysis strategy. AI engines are designed and trained to analyze code based on historical data, and therefore may not always accurately interpret novel techniques or highly obfuscated code in malware implementations. As a result, the summaries they generate may sometimes lack sufficient or useful information for analysts.

Clustering With Search Modifiers

The extensive number of VirusTotal search modifiers enables analysts to query the platform in a practical and precise way. This allows for retrieving submitted artifacts and related information that are relevant to specific threats under investigation. However, false positives (where retrieved data is not related to the investigated threat) and false negatives (where relevant data is missing) can impact the relevance and completeness of search results.

The way in which queries are formulated is important for addressing or alleviating the impact of these challenges. Combining search modifiers using the logical operators AND, OR, and NOT and refining search queries helps reduce the likelihood of false positives and false negatives. This is an iterative process where analysts may integrate information obtained from multiple sources into their query formulations.

For example, malware analysis may provide characteristics suspected to be unique to the investigated activity cluster, such as specific file names, hashes, registry keys, network indicators, code signatures, strings or functions used by the malware, or distinct patterns of behavior. Additionally, information from previous reports documenting activities potentially related to the current investigation can also be beneficial. Upon reviewing the accuracy and completeness of query results, analysts may adjust the queries to further improve their relevance and precision. To illustrate these concepts, we provide an example from an actual threat research investigation.

Clustering Scenario

In 2023, SentinelLabs conducted an investigation into suspected China-nexus actors targeting Southeast Asian gambling companies. The investigation led to the AdventureQuest.exe Windows PE executable, which had been submitted to VirusTotal on May 11, 2023. Analysis of the file revealed it to be a malware loader implemented using the .NET framework, deploying further executables on compromised systems. These executables download archive files from attacker-controlled servers. The archives contain sideloading capabilities, including malicious DLLs sideloaded by legitimate executables to deploy the Cobalt Strike backdoor.

AdventureQuest.exe is signed with a certificate issued to the Ivacy VPN vendor PMG PTE LTD (certificate serial number: 0E3E037C57A5447295669A3DB1A28B8A). It is probable that the PMG PTE LTD signing key has been stolen, a tactic often used by suspected Chinese threat actors to sign their malware. Based on overlaps in code and functionalities with malware observed in Operation ChattyGoblin, AdventureQuest.exe is likely part of the same activity cluster.

The certificate’s serial number provides a starting point for identifying any other malware loaders submitted to VirusTotal that are signed with the same certificate and share implementation characteristics with AdventureQuest.exe, suggesting a potential link to the same threat actor or campaign.

VirusTotal uses the Sigcheck tool to extract digital signature information from submitted Windows PE files, including the serial numbers of code signing certificates. After extraction, this information is stored in the signature_info attribute of the corresponding file objects. Users can query VirusTotal for specific signature information using the signature search modifier.

The signature_info attribute (in the file object for AdventureQuest.exe)

The signature_info attribute (in the web analysis report for AdventureQuest.exe)

The query signature: "0E3E037C57A5447295669A3DB1A28B8A" searches for submitted files that have the serial number 0E3E037C57A5447295669A3DB1A28B8A in their signature information. The query returns 94 results, including both Windows PE executables like AdventureQuest.exe and other file types, such as Windows DLLs.

VirusTotal attempts to determine the type of each submitted file using third-party tools that search for magic numbers (byte sequences) and other types of signatures that identify specific file types, such as 0x4D 0x5A for Windows PE executables. These tools include the Unix utility file, Detect-it-Easy, and AI engines. The platform then inserts keywords indicating the file type in the type_tags attribute of the corresponding file object. Users can query VirusTotal for specific keywords stored in type_tags using the type search modifier.

The type_tags attribute (in the file object for AdventureQuest.exe)

The type_tags attribute (in the web analysis report for AdventureQuest.exe)

Building on the previous search, the query signature: "0E3E037C57A5447295669A3DB1A28B8A" AND type:"peexe" narrows the results to submitted Windows PE executables. The query returns 31 results, some of which, like AdventureQuest.exe, are implemented using the .NET framework, while others are not.

The file and Detect-it-Easy tools may provide VirusTotal with information about the environments in which submitted executables are built. The platform stores the output from these tools in the magic and detectiteasy attributes of the corresponding file objects. Users can query VirusTotal for specific content in these attributes using the magic and detectiteasy search modifiers.

The magic attribute (in the file object for AdventureQuest.exe)

The magic attribute (in the web analysis report for AdventureQuest.exe)

The detectiteasy attribute (in the file object for AdventureQuest.exe)

The detectiteasy attribute (in the web analysis report for AdventureQuest.exe)

Building on the previous search, the query signature: "0E3E037C57A5447295669A3DB1A28B8A" AND type:"peexe" AND magic:".NET" further narrows the results to submitted executables built using the .NET framework. The query returns 13 results.

Closer examination of the resulting files shows that most have PDB paths, such as Ivacy.pdb. However, AdventureQuest.exe does not have a PDB path, which is typical for malware, as malware authors often strip executables of debug information. This suggests that the files with PDB paths may not be associated with the investigated activity.

VirusTotal extracts information from the header of each submitted Windows PE executable and stores this information in the pe_info attribute of the corresponding file object. Users can query VirusTotal for specific content in this attribute using the metadata search modifier.

The pe_info attribute (in the file object for AdventureQuest.exe

Further, the resulting files with PDB paths had been residing at file paths in the Ivacy VPN installation directory, such as C:\Program Files (x86)\Ivacy\IvacyService.exe. For each submitted file, VirusTotal records the names under which the file has been submitted, which may be full file paths rather than just filenames. The platform stores this information in the names attribute of the corresponding file object. Users can query VirusTotal for specific content in this attribute using the name search modifier.

The names attribute (in the file object for AdventureQuest.exe)

The names attribute (in the web analysis report for AdventureQuest.exe)

Based on our insights and previous research on Operation ChattyGoblin, we know that the threat actors do not disguise their malware as Ivacy VPN components. This suggests that the files that had been located in the Ivacy VPN installation directory before submission to VirusTotal may be false positives. An analysis of some of these files using a .NET decompiler revealed that they are indeed legitimate Ivacy VPN components.

Building on the previous search, the query signature:"0E3E037C57A5447295669A3DB1A28B8A" AND tag:"peexe" AND magic:".NET" AND (NOT metadata:".pdb") AND (NOT name:"Program Files (x86)\ivacy") further narrows the results to submitted executables that do not have PDB paths and had not been located in the Ivacy VPN installation directory before submission. The query returns one result, AdventureQuest.exe.

This suggests that VirusTotal does not host other malware loaders, which are signed with the same certificate as AdventureQuest.exe and are likely linked to the investigated threat cluster. However, the extensive number of VirusTotal search modifiers allows for the identification of such loaders based on characteristics beyond the used code signing certificate. For example, querying VirusTotal for a code segment specific to AdventureQuest.exe using the content modifier leads to further malware that is likely part of the same activity cluster. We leave this as an exercise for the reader.

Search queries and results

Clustering With Search Modifiers | Limitations

Certain aspects of how VirusTotal collects information on submitted artifacts, which users can query using search modifiers, may increase the likelihood of missing relevant findings in some search scenarios. This is particularly relevant given the third-party tools and functionalities that VirusTotal uses for collecting this information, such as sandboxes and detection engines. Each of these tools has specific limitations, which affect the quality and quantity of information VirusTotal collects and stores in its dataset. In this section, we highlight some of these limitations to help users understand how they impact querying VirusTotal with search modifiers.

As mentioned earlier, VirusTotal executes submitted executable files (executables and scripts) in sandboxes to capture behaviors and artifacts visible only during execution. Additionally, most of the sandboxes VirusTotal integrates can identify MITRE ATT&CK techniques exhibited during execution. This is accomplished through a set of rules that map observed behaviors to MITRE ATT&CK techniques.

For each submitted file, the sandboxes generate a report documenting captured activities, which are accessible to VirusTotal users. To facilitate systematic searching of sandbox-generated data, VirusTotal stores this data in an object of type file_behaviour. This object has a relationship to the file object that is directly related to the submitted file. Users can query sandbox-generated data using a variety of search modifiers, such as behavior_created_processes (searches for a name of a created process), behavior_files (searches for a name or path of an opened, written, deleted, or dropped file), or attack_technique (searches for a MITRE ATT&CK technique ID).

Sandbox-generated data in a file_behaviour object

VirusTotal’s sandboxes may not always capture relevant behaviors of executable files, for example, due to execution conditions that must be met or techniques intentionally implemented by malware authors to evade sandbox analysis. This includes command-line parameters, library or platform dependencies, or external configuration or data files. In contrast to private submissions, VirusTotal automatically executes the vast volume of executable files continuously submitted to the platform's public corpus in sandboxes, without customizing their execution or execution environments. As a result, search queries with modifiers applied to sandbox-generated data, like behavior_files or attack_technique, may return incomplete results.

For example, the BlackCat ransomware requires operators to provide an execution password as a command-line parameter (referred to as an 'access token') for the malware to initiate encryption. For the BlackCat sample veros3.exe, the CAPE sandbox has not captured any file system activities, such as deleting, creating, or modifying files. When the correct access token is provided, this sample enumerates the files and folders on the filesystem of a compromised system and encrypts files as specified in embedded configuration data.

The CAPE sandbox report for veros3.exe

In addition to running executable files in sandboxes to capture their behaviors, VirusTotal is capable of extracting malware configurations from these files. To achieve this, the platform uses the Mandiant Backscatter malware analysis service, which implements configuration extraction modules to automatically extract configurations based on known implementation patterns. Users can search through extracted configuration data using the malware_config search modifier. However, the automated extraction might not work when the analyzed malware uses new or changed methods for storing its configuration data, which are not covered by the existing modules. As a result, search queries involving malware_config may return incomplete results.

It is also important to note that some information related to submitted artifacts may change over time. For example, the last_analysis_stats attribute of the file object stores the number of third-party detection engines that have labeled the corresponding submitted file as malicious. Users can use the positives search modifier to narrow searches based on whether this number is less than, greater than, or equal to a user-specified value. For example, positives:20+ narrows searches to files that have been labeled as malicious by more than 20 engines.

last_analysis_stats attribute

Setting positives to a relatively high number is a way to focus searches on files that are likely to be malicious. However, the returned results may not include malicious files for which an insufficient number of third-party detection engines have developed detections. The development of detections is fully in control of the engines' vendors and may depend on a variety of factors.

A common factor prompting vendors to develop a detection for a specific malware implementation is the public release of a threat research report listing files that implement the malware. For example, on September 21, 2023, SentinelLabs released a report on the Sandman APT group, identifying the UpdateCheck.dll file as malware used by this group. Prior to this date, on March 15, 22, and 29, 2023, the number of engines detecting the file as malware was 5, 6, and 7, respectively. Shortly after the release of the report, this number spiked to 17 and reached 53 by September 29, 2023.

Number of engines detecting the UpdateCheck.dll malware

Conclusions

Effectively using VirusTotal for threat research requires a good understanding of the platform’s wide range of querying capabilities, the scenarios in which these capabilities return informative results beneficial to investigations, and the factors that may impact the completeness or relevance of the data returned.

While the GUI provides an agile and user-friendly way to query VirusTotal, the API enables large-scale querying, offers expanded querying capabilities, and allows for retrieving more extensive information. Additionally, the AI engines that VirusTotal integrates can significantly speed up malware analysis efforts; however, their outputs should be considered as part of a broader analysis strategy as they may lack sufficient or useful information due to limitations in design or training data. Moreover, the extensive set of search modifiers provides flexible search capabilities, but the relevance and completeness of results can be impacted by false positives and false negatives.

SentinelLabs and VirusTotal are committed to sharing information and insights that help new users gain a solid understanding of the platform’s capabilities, enabling them to make full use of the available VirusTotal features and conduct thorough investigations.

Popular Posts

Blog Archive

Wednesday, June 04, 2025

Thursday, January 09, 2025

Introduction

Tell me what role you have and I'll tell you how you use VirusTotal

Our approach

Our detections for the community

Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer

Sysmon event for: Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer

File Creation Related To RAT Clients

Sysmon event for: File Creation Related To RAT Clients

Wrapping up

Tuesday, November 12, 2024

Friday, October 18, 2024

JA4: A More Robust Successor to JA3

Unveiling the Secrets of the Client Hello

JA4 in Action: Pivoting and Hunting on VirusTotal

Wildcard Searches

YARA Hunting Rules: Automating JA4-Based Detection

A few Interesting JA4 Hash examples

JA4: Elevating Threat Hunting on VirusTotal

Monday, September 30, 2024

Thursday, August 29, 2024

Overview

The VirusTotal Dataset

Querying VirusTotal

The VirusTotal GUI

The VirusTotal API

The API vs. The GUI

AI Search Modifiers

Clustering With Search Modifiers

Clustering Scenario

Clustering With Search Modifiers | Limitations

Conclusions