Monday, June 12, 2023

AI boosts Code Language and File Format identification on VirusTotal

We are pleased to announce that VirusTotal has improved the identification of programming languages and file formats through the implementation of Generative AI (artificial intelligence). Historically, automating these tasks has been quite challenging, especially when it comes to certain scripting and plain text file formats. However, with the aid of Generative AI, we have expanded our programming language coverage and file format identification capabilities, enabling us to directly overcome these hurdles.

Understanding File Formats: Overcoming Challenges in Identification
File identification utilizes a diverse set of techniques. The most straightforward and prevalent of these is file extension analysis. However, this method isn't foolproof, as it can be easily deceived by simply renaming the file. To circumvent this weakness and boost the precision and reliability of file identification, several techniques have historically been employed. These time-tested strategies encompass the use of magic numbers, signature-based scanning, and structural analysis.

"Magic numbers" serve as unique signatures, identified by distinctive byte sequences at the beginning of a file. For instance, the magic number "MZ" (hexadecimal "4D 5A") signifies executable files in the DOS family and Microsoft Windows PE format. Similarly, the magic number "0x25 0x50 0x44 0x46" represents PDF files.

In addition to magic numbers, signature-based scanning plays a crucial role in detecting file formats. It matches byte patterns within files, particularly when magic numbers are absent or when dealing with complex formats. Signature-based scanning enables the identification of various file types, such as MP3 audio files, MPEG video files, SQLite databases, and more.

Structural analysis is another approach used to identify file formats, focusing on the file's structure, patterns, and other distinctive characteristics. This comprehensive approach allows for the accurate detection and classification of a wide range of file formats.

Despite the success of these techniques, challenges arise when dealing with plain text files, especially programming and scripting languages. Unlike other file types, plain text files lack clear magic signatures or well-defined structures, making it more difficult to accurately determine their content.

Advantages of AI-based Code language and File format identification
To overcome those challenges, VirusTotal has embraced the power of Generative AI based on Sec-PaLM, a LLM (large language model) extensively trained on a vast corpus of data to learn the underlying patterns, grammar, and vocabulary of programming and scripting languages. This new capability empowers the analysis of clear text code snippets within files, resulting in improved programming language identification and bringing significant advantages:

  • Increased Accuracy: LLM's training on extensive code repositories equips it with an in-depth understanding of programming and scripting languages. This enables VirusTotal to achieve higher accuracy in language identification, reducing false positives and ensuring more reliable results.
  • Real-Time Adaptability: LLM's AI-powered approach facilitates dynamic learning, allowing VirusTotal to adapt and evolve as new programming languages and frameworks emerge. This ensures that VirusTotal remains at the forefront of language identification, even in the face of rapidly changing technology landscapes.
  • Expanded Language Coverage and File Format: With the aid of generative AI, VirusTotal can identify a wider range of programming and scripting languages, including both mainstream and niche languages. This enhanced coverage empowers developers, researchers, and security professionals to detect potential vulnerabilities and threats across diverse codebases.

This new feature refines the existing capabilities of our system. Specifically, when dealing with previously supported file types, this new method directly changes the 'File type' attribute. To illustrate this, let’s see the following case where Magic recognized a file as 'text'. However, when the AI system analyzed the same file, it concluded that it was 'Perl'. Consequently, the file type was then adjusted to 'Perl':

Perl sample

In this other example, we observe how Magic categorizes the file as "text", while AI identifies it as "AutoIt". Since "AutoIt" is not a commonly supported file type, it is added as a secondary or additional label in the File type section:

AutoIt sample

By utilizing multiple tags, we prevent disruptions to established processes, canned searches, and YARA rules reliant on the "text" tag, while enabling the development of new use cases with more detailed labels. Our approach focuses on improving the platform while respecting backward compatibility, avoiding interruptions, and ensuring seamless integration with existing user workflows.

In either scenario, the crucial aspect is that these labels form part of the file identification. They are indexed and can be used to conduct searches in VirusTotal Intelligence using the "type:" modifier. For example, a query such as "type:AutoIt positives:1+ fs:2023-06-01+" will return files labeled as AutoIt, received from June 1, 2023 onwards, with at least one positive detection:

The range of programming languages and file formats that this new AI-based feature can identify is very extensive. In just the first few hours of processing "text" files from VirusTotal, it identified over 100 different formats. Here are some examples: 6502 Assembly, AutoHotKey, Bash, Batch, C#, Delphi, EDS, INI, Inno Setup, JavaScript, Go, LESS, Lua, Mathematica,PHP, Solidity, QML, SQL, R, Ren’Py, Rust, Vimscript

We are confident that this new capability will empower security analysts to hunt threats hiding in rare code languages or unique file formats. As we leverage the power of IA, we greatly value your feedback. Stay tuned for updates and thank you for your ongoing support.


Post a Comment