Wednesday, September 15, 2021

Introducing VT Alerts

 360-degrees insights into your assets


Many VirusTotal’s users deploy rules to monitor that their assets, including domains, IP ranges and intellectual property are not being abused by any attacker. Today we are proud to introduce VT Alerts, a solution designed to help detect any abuse in (almost) real time.

Before going into more details, we invite you to join us next September 28th at 17:00 CEST for our VT Alerts webinar if you want to know everything about it.


What is VT Alerts?


VT Alerts is designed to provide a feed-like experience providing different severity notifications on anything we detected in VirusTotal that matches your infrastructure and intellectual property. It offers you an easy way to create watchlists where you can include all the network-related assets of your company, as well as terms related to your brands and intellectual property. 

VT Alerts will take it from here, automatically monitoring anything new we detect in VirusTotal matching this criteria and providing you with alerts on anything suspicious we find in the VirusTotal ecosystem. 

Watchlists and notifications

We have designed three types of watchlists: Domains, IPs and Brand. Here is how it works.

For networking infrastructure there are two types of watchlists. The first one is for Domains and basically expects a list of, well, domains, and the second is for IP Ranges belonging to your infrastructure that you want to monitor. Once this is defined, alerts are created when any of the following criteria is met:

  • Any URL under any domain or subdomain in the Domains watchlist is detected as malicious in VirusTotal.
  • Any malicious file contacts a URL under any domain or subdomain in the Domains watchlist.
  • A malicious file is downloaded from a URL under any domain or subdomain in the Domains watchlist.
  • We detected any URLs or domains reusing any domain or subdomain in the Domains watchlist.
  • We find any Domain or IP in the content of any malicious file.
  • Domain typosquatting notifications.

Brand protection works a bit differently, as here you can provide any terms related to your business, brands and intellectual property. We will search for these terms in any suspicious network infrastructure and communication to detect abuses. 

For example, we will search for any link including any of the keywords from your list of terms to monitor, which is a very usual tactic used by attackers in phishing campaigns. We will also check if any of the terms are found in any website we analyse, flagging potential abuse by third-parties. These terms could be any strings that are characteristic to your brand or that are candidates to be reused by attackers, for example, in phishing websites cloning yours. Additionally to these strings, we will need your domains (allow-list) so we make sure every time we find something suspicious it does not belong to you. 

Alright, now everything is set and ready in your Brand watchlist. VT Alerts will let you know when:

  • Any domain outside your infrastructure reuses any of your favicons.
  • Any URL outside of your infrastructure uses any of the strings you defiled.

And that would be it. Next: let’s learn more about the Alerts.


Alerts

After everything is set, we will start receiving alerts in our dashboard according to the criteria defined above. 


Each notification we get provides the following information:

  • Creation timestamp - when the alert was created
  • Severity (Info, Low, Medium, High) - depending on the detection ratio
  • Event type (URL sighting, File download, File contact, Domain sighting, IP sighting)
  • VirusTotal verdicts’ snapshot at notification time
  • Type of entity involved in the alert (Domain, URL, IP, File)
  • Detection Category
  • Use-Case Category

All this data is presented in a simplified form in the dashboard, but you can click on any of the items of the alerts to pivot to its VirusTotal report (for example, in the case of a domain or file) or you can click on the magnifying glass to get full information of the event.

All this information is also ready to be automated through our API.

We also added different options to make filtering noise out easier, or to quickly pinpoint worrisome alerts. For example, we can filter by time of the alerts, severity or even categorize potential false positives straight away.

Or use any of combination of the filters provided:


What now?

Notifications are the starting point for your investigation. Alerts contain all VirusTotal’s links to any involved entities, so you can continue your investigation as you’d normally do.

FAQ

Q: Where is all this data coming from?

A: We use the VirusTotal corpus to watch for events to generate notifications on. Alerts won’t crawl nor visit any of the domains nor IPs that you create watchlists for.

Q: Is any of my information made public?

A: All of VT Alerts’ data (watchlists and feeds) is private to your VirusTotal group and is not published, directly or indirectly, anywhere else. Keep in mind that the suspicious artefacts found in the alerts, like URLs and files, are still accessible to other VirusTotal’s users.

Q: Can I get notifications pushed to my email/vendor/etc.?

A: We’re gathering feedback on the most sought-after delivery channels and adding those in the coming weeks.

Q: Are VT Alerts running retroactively?

A: No. At the moment, you will start receiving notifications once your alerts are set and ready.

Future Features and Parting Words

VT Alerts will be evolving quickly in the following months, with more features in the pipeline like notifications for WHOIS changes, improved typosquatting alerts, more brand protection notifications, watchlist management for large-scale users, and many more.

We’re excited about VT Alerts future, and we hope it will become your tool of choice for asset monitoring. Please let us know if you are interested or if you have any ideas how to make it better.


Happy hunting!


Friday, September 03, 2021

Applied Yara training - Q&A

by Vicente Diaz (@trompi) from Virustotal, Costin Raiu (@craiu) from Kaspersky and with the kind support of Victor M. Alvarez (@plusvic) from Virustotal

Introduction

On August 31, 2021 we ran a joint webinar between Virustotal and Kaspersky, with a focus on Yara rules best practices and real world examples. If you didn’t have the chance to watch the webinar live, you can see it as a recording on Brighttalk: Applied Yara training.

During the webinar we received an overwhelming response and we would like to thank all the participants for sharing their thoughts, questions and ideas; most of all, we are happy to see so much interest and enthusiasm for Yara!

During the 90 minutes of the webinar we only had the chance to answer a fraction of the questions we received. We would still like to answer the remaining ones, since we thought a lot of them are quite relevant to real world situations, practices and could be useful to other security practitioners. Even better, for the more tricky questions we decided to ask for help from the creator of YARA itself, Victor Manuel Alvarez (aka Hector Manuel Velasquez) who will help answer them. If you have further questions, please feel free to send them to us in the comments section. We will be happy to answer them too!

Stay safe, stay secure and Happy hunting!

Costin, Vicente and Victor

RULE WRITING


Q: How difficult is it writing a YARA rule for obfuscated payloads? What file features normally you experts often look into when it comes to obfuscated files? How can YARA help?  What would be your tip / best practices for writing rules to catch obfuscated binaries? 

Vicente here. Obfuscated files are tricky, but YARA can still be useful. We can use all the metadata and file geometry for the detection. Also, depending where the obfuscation is, maybe we can also use some portions of the code for the detection - for instance if it only obfuscates strings with some custom method, maybe the code used for obfuscation can be useful. 

Costin here. In general, it is a lot more difficult to write Yara rules for obfuscated payloads. Depending on the obfuscation method, one can still find some ways to detect them. For example, assuming a specific cryptor or packer was used, you can still write a rule for the packer (eg. UPX) or, rely on an unpacking engine to give you the plain code. Some platforms, such as VirusTotal, would automatically unpack known tools for you, which allows one to write simple Yara rules for the unpacked code. When the obfuscation is polymorphic, for instance, the code is expanded with dummy instructions and operands are split within several operations, once can try to use other file properties, such as metadata, entropy, import hashes or other data which stays constants across different generation. In short, there is no rule on how to write rules for obfuscated code, but in some cases, it is possible. 

Q: Do hashes used for API hashing for example would be helpful for a YARA rule? Since this can change in a future campaign, i'm in doubt 

Costin here. Absolutely - API hashing can be super useful for detection of malware, especially when there are very few unique strings that can be used in the rule, or, when the malware is otherwise obfuscated. Yara actually provides a nice easy to use solution - the pe module implements the standard Mandiant import hash function as pe.imphash(). This can be used in a Yara rule condition such as:


Q: Are you trying rules against a set of malware before releasing it ?

Costin here. Yes, extensive QA of a Yara rule is critical for us before releasing it publicly or to customers. For this purpose, we use several internal databases, of both malicious files and known clean code. To make sure the Yara rule detects more than just one sample, we try to run it against our entire malware collection, which is over 5 petabytes at the moment. Sometimes, when time is essential, we run it on a subset of the malware collection, such as specific PE files received during the last 12 months, or, say, script files. In many cases, testing a rule on clean files is even more important than testing it on malware! Based on our experience, we’ve seen countless instances of Yara rules that were published by security companies or even governmental agencies which produced false positives when used on a real system. To simplify testing of Yara rules, you can either use VirusTotal’s Retrohunt feature, or set up a test of your own collections using our open source KLARA project

Q: Won't typos also restrict the Yara rule to a particular sample or samples distributed under one campaign? 

Vicente here. That could happen, but maybe the typos are difficult to replace. For example, sometimes typos are in the commands that the malware receives from the server, which would need a redeployment of server and client side malware from the attacker. In some other cases attackers get fully unaware of the typo, maybe because they are not familiar with the language and just use a weird expression that no native speaker would use, and can stay there forever. We can find these typos everywhere, including metadata, comments, etc.

Q: Where are good sources for large amounts of known good clean data? 

Costin here. There are several public, free good sources of known good clean data. Some have suggested the NIST reference set, however, you can also build your own from things like a Windows installation, Linux install, Android/iOS dumps and the likes. It’s important to also have third party software, such as Chrome, Firefox or Adobe Reader. A lot of false positives produced by publicly available Yara rules occur on software such as the ones above!

Q: How useful are Yara Rule Generators like Florian Roth's yarGen? 

Vicente here. Yara rule generators are VERY useful, but we do not recommend using raw generated rules without an extra round of manual polishing. We believe it is more useful to use it for a first round to extract potentially relevant strings from a collection of samples we are analyzing, and from here use these results to help us build more refined rules.  

Q: Is it better to use wide, wide ascii, both, or none?

Costin here. Using wide, ascii or both (or none) depends on a case by case basis. Depending on how a malware piece is compiled, the strings you see inside could be ascii (single byte) or wide (double byte, Unicode). Yara allows you to easily search for UTF-16 strings through the wide modifier. By default, ascii is used every time you assign a string to a variable, however, if you want to search both, you need to use “ascii wide”.  Adding “wide” to the ascii strings you are searching for might find you additional stuff.

Q: If a rule generates false positives on a very specific sample, would you suppress a specific hash in the rule directly or rather improve the whole logic?

Vicente here. The answer is - “it depends”. If there is a very particular binary that produces a false positive in an otherwise solid rule, we can keep the rule as long as we know what we are doing - for example we use it for hunting privately, and just exclude that specific FP file by hash. In case this rule would be used externally or automatically, for instance in a production environment for detection, then better to avoid this false positive in the logic.

Q: some programming languages have automatic formatters (like 'black' in Python, or gofmt in Golang) -- do you recommend something similar for Yara, to maintain good formatting across a team?

Víctor here. As far as I know such a tool doesn’t exist. There are some alternative parsers for YARA, which are able to read a YARA source file, build an abstract syntax tree (AST) from it, and regenerate the source code from the AST. But they have limitations that make them unsuitable for building a tool like gofmt (for example the comments are completely lost).

YARA INTERNALS


Q: Does yara "digest" and optimize  a ruleset for example user writes "i001" and "i001"

Víctor here. YARA doesn’t perform any optimization on the condition, they are evaluated exactly as they are written. No attempt is made to detect useless branches in the condition, if you write “X or Y or Z or true” YARA evaluates X, Y and Z, it is not smart enough to realize that the condition is always true. Also, if you use the same pattern/string in different rules they are treated as if they were different, YARA doesn’t realize that you are searching for the same string. 

Q: Can YARA run on only files with Mac OS xattributes for example com.apple.metadata:kMDItemDownloadedDate, com.apple.quarantine?

Vicente here. At the moment, I’m not aware of any module able to interact with them. 

Q: I have seen that hash is case sensitive .. i have to use lower case . Will this change in future? 

Víctor here. The next version of YARA will include the icompare operator for case-insensitive string comparison. The == operator maintains its case-sensitive semantic when used for comparing strings, but with icompare you don’t need to worry about whether the values returned by the hash module are upper or lower case.    

Q: Does YARA see filename in the file set?

Vicente here. At the moment, you cannot specify any condition about the filename in a rule, as YARA is designed just to check the content and structure of the file and the file name can easily change. Nevertheless, this option exists in case you use the VT YARA module, which is available to take advantage of file metadata for YARA rules running on VirusTotal. You can also use the -D option to define a variable in the rule that contains the filename.

Victor here. I would like to provide more context about this. The reason for not including a “filename” keyword in a similar way to “filesize” is because “filesize” makes sense in almost any context (except when scanning a process address space) as the data being scanned always have a size. However, “filename” makes sense when you are using the yara command-line tool for scanning your hard drive, but it doesn’t make sense where YARA is scanning the data without knowing where the data comes from. However, due to the high number of times I’ve seen this question asked, I’m considering some intermediate solution, like allowing the command line tool to define a variable that automatically takes the name of the current file.

Q: Is it possible to use YARA to monitor strings in PDF documents? Or is it medium-dependent?

Vicente here. It is possible. However keep in mind that strings are not represented inside the file the same way they are displayed to a reader. It is important to check the content of the file itself with utilities such as “strings” or with any hex editor, and select which strings can be used for a YARA rule. But it will not necessarily work if you select strings based on how they appear in the PDF when displayed. 

Q: Is YARA compatible with the ELK stack?

Vicente here. There are different available options, like plugins to incorporate events matching YARA rules into Elastic, for example. 

EFFICIENCY

Q: Is uint16(0) faster than $magic at 0? (Where $magic is the hex value)

Vicente and Costin here. The $magic check should be considered obsolete and is hopefully not used anymore in any public rules. Please kindly use uint16(0) or the Magic module instead. This is because defining a string such as $mz=”MZ” will cause Yara to search for this short string and save all the matches in a file, no matter the offset, which greatly increases the resource overload. This slows down scanning and eats up more memory. In general, short strings should be avoided in Yara rules for this reason, or always followed by the “fullword” modifier.

Q:  In your experience, is CPU or I/O the bigger constraints? Can YARA run at low-priority? only on user idle restricted to time schedule? I am thinking about tunability like worldCommunity grid. Does YARA use sophisticated I/O scheduling? 

Víctor here. That depends a lot on your rules. If you have a few high-performance rules the bottleneck will certainly be I/O, but it can change drastically if you have many rules or they are not very fast. YARA doesn’t do anything special with regards to I/O, it just uses memory-mapped files and relies on the operating system for I/O. Both in Windows and Linux, YARA tells the operating system that files are likely to be read sequentially, so that the OS can take that into account and read-ahead aggressively. 

Q: Hello. Is the “nocase” option efficient? We can never know what case we can find in a malware file, and maybe the “nocase” option increases the search time and the used resources. How to determine if we need this option? 

Víctor here. It depends on the string, but generally speaking “nocase” introduces a performance penalty and it should be avoided if possible. Many people use “nocase” and “wide” just in case. They are not really sure about it, but they use them because it doesn’t do any harm (or so they think). My advice is doing exactly the opposite, if you don’t have a clear reason for using those modifiers, don’t do it. If you know for sure, or have good reasons for believing that the string can appear in some arbitrary casing, use “nocase”, but avoid it if otherwise.

Q: My understanding is that the 'filesize' condition is evaluated AFTER all strings have been processed. If this is the case, then filesize cannot be used to make rules more efficient by eliminating samples based on filesize BEFORE strings are evaluated. Is my understanding correct? If so, do you know the reason why filesize is evaluated after strings? 

Víctor here. That’s 100% correct. The reason for that behavior is that YARA is optimized for the case in which you have many rules. From the performance standpoint the best thing to do when you have a lot of rules is scanning the file in a single pass first, looking for all the strings from all the rules at once, and then evaluate all the conditions. Evaluating the conditions first, searching for each string individually as they are used in the condition does not scale well when you have thousands of rules. With a very high number of rules the odds are that you are going to need reading and scanning the file anyways, as at least one of your conditions won’t filter out the file because of its size.

USING YARA


Q: if you have a mature team, and a large set of rules that were developed over time --- do you have any thoughts on how to go back and re-evaluate if rules are still good, or need to be updated? (and how often, etc.) -- or other ways to track metrics on each rule. ('hey, this one fired once, and it was a great find -- this 2nd one fires with every SCR, and needs some work') 

Vicente here. This is an excellent question. What would be important is to create a policy based on your needs and how you use these rules. Some of the rules can be valid for years while others can change very rapidly, and that depends both on the rule itself and on the threat they monitor. It is always good to have a baseline detection per rule (what it should be detecting) and find alternative methods to double-check these detected samples indeed belong to the family/actor you are monitoring. From this point, you need to work on keeping updating these rules, detection methods and detected samples triad regularly, polishing rules as needed and replacing them once they are not relevant anymore.

Q: Is there a way to centrally run YARA rules to all network workstations without depending upon other third party tools like Nessus or supported antivirus platforms? 

Vicente here. There are different EDRs and utilities to do so. However this kind of practice tends to be overkill in most cases. We recommend carefully selecting what rules to run, what folders and adding further conditions to the rules in order to avoid scanning unnecessary files (you can play with conditions such as file format, size, timestamps, etc).

Q: We have tried YARA ourselves and there's no doubt about the capabilities. What we want to understand is what is the correct way to leverage Yara? Shall we scan some workstations regularly with industry/region specific threat intelligence? Or a better approach could be to run periodic scans based on targeted hypothesis based threat hunting? Shall we only run it to investigate incidents?

Vicente here. This totally depends on your goals, as all the above are very usual use cases. In most of the cases, you want to regularly scan a small percentage of sensitive or suspicious files on a regular basis. The same with your rules, you want to maybe use the ones corresponding with active and relevant threats. In parallel, you will always have your collection active for hunting and ready for any IR/forensic if needed.

Q: Can we also search within memory with YARA?

Vicente here. It is possible. There are several EDRs and tools that allow you to do so. Also, it is always possible to use YARA against a memdump, for instance using Volatility. Last but not least, you can directly scan a running process, if you know it’s PID, by running “yara rule.yara PID”.

Q: Is it possible to utilize hash detection with YARA?

Vicente and Costin here. Yes, it is. You can use the “hash” external module and calculate different hashes for files or sections.  Here’s an example of such a rule from Github Xprotect.yara:



Q: What do you recommend for managing a large collection of YARA rules? (Deduping, updating versions, etc) 

Vicente here. We do not recommend any utility in particular for this, many researchers simply use GitHub. Our recommendation in terms of procedures would be always checking that rules detect what they are supposed to detect and checking against false positives, and to do this regularly. One utility you can use in this direction is YARA-CI. We also recommend having different collections for different purposes, for instance Incident Response, Forensic, Hunting, Mem-scanning, etc. 

Q: How can we start using YARA for malicious email attachments  

Vicente here. There are different mail security appliances that allow you to do so. You can always check attachments separately.

Q: For memory dump, I was not able to scan directly by attaching it, rather I had to mount its file system (E.g. MemProcFS) before Yara can work. Am I correct? 

Vicente here. Right. Another option is using the dedicated utility in Volatility to run YARA against memdumps.

VIRUSTOTAL


Q: Will VT allow us to scan older parts of the corpus with retrohunt? 

Vicente here. At the moment, VirusTotal Retrohunt works on a 90 day back index that can be expanded to 1 year depending on your subscription. 

Q: Yara rules can be shared with the VT community? or these  only can be shared from user to user

Vicente here. We incorporate a number of repositories from the community as crowdsourced YARA rules, you can find a list of contributors here. If you would like to contribute, please contact us!

Q: Can we find strings inside pdf files on virusTotal with YARA rules? we can see it working on other files types but not on pdf, we didn't really see it working  and could not find something about it on the VT YARA documentation. 

Vicente here. As described in a previous question, nothing changes when using VirusTotal to find a string within a PDF than when using YARA in any other environment.

Q: What are ways of hunting with YARA beyond virustotal? 

Vicente here. Hunting is a technique/art/discipline that can be done on any platform. Basically you just need a collection of data you can explore to find what you are looking for. Usually that would mean a collection of malware with as much data on top of it as possible. In VirusTotal we work hard to make it as convenient as possible.

MISC


Q: Also consider that you can tip off the threat actor that you have found their malware…

Vicente here. This is true. Many actors are interested in understanding how they are detected, which can become quite obvious when checking YARA rules publicly available. It is a cat and mouse game, but happily the fact that actors understand how they are being detected doesn’t mean they can avoid it in an easy or quick way.

Q: Can anyone explain why YARA was created with a unique schema (yet similar) compared to SQL? Do developers see YARA rules as a catch-all rule writing standard that will eventually become a standard for querying any data? I find YARA rules far easier to write/read than many formats, and it seems more modern, but the evolution is unclear.

Víctor here. The syntax was created with legibility in mind because YARA rules are intended to be created/consumed by humans. The idea was creating a language that looked more like a programming language than like SQL, in fact you can find reminiscences of C in YARA’s syntax. However, YARA doesn’t pretend to be a general purpose query language for usages outside the scope it was designed for. Any future enhancements in the syntax will be oriented towards improving expressiveness and legibility, but always within the boundaries of its current purpose.

Q: How do you communicate the importance and utility of incorporating threat hunting techniques (like writing Yara rules) to muggles? 

Vicente here. Threat Hunting is one of the best techniques we have. We use it to defend ourselves from current attacks, by expanding our visibility and establishing monitoring on threats targeting us. It is also one of the most powerful weapons we have to detect unknown threat activity.

Q: Maybe I'm jumping ahead, please forgive me if I do; but how do commercial Anti-Virus companies use YARA to determine whether a file is being malicious? How do you deal with false-positives? 

Vicente here. YARA is one of the methods or engines that AV companies might use to determine whether something is malicious. In essence it is not different from other methods, and depending how it is being used can lead to False Positives. As it usually happens, this has nothing to do with the tools used but with how solid the rules are, if they are being double-checked with any second method, if there is a reputation system in place, how good the heuristics are, etc. 



Tuesday, August 10, 2021

A Sneak Peek into VT Alerts

 360-degrees insights into your assets

One of the most sought-after use cases in VirusTotal is to find information about how our assets might be being abused. Is there any attacker using our brand for some spear phishing? Are there any malware samples using our infrastructure for malware distribution? Traditionally, many of our users monitored the use of their assets (such as domains, IPs and intellectual property) using YARA Livehunts © or VTI searches. Good news everyone! To help our users make this use case more straightforward, automated, and automatable (and also to have the results in a colourful and nice interface), we are introducing VT Alerts.

VT Alerts is designed to provide a feed-like experience providing different severity notifications on anything we detected in VirusTotal that matches your infrastructure and intellectual property. 

At launch, VTAlerts will be ready to help our customer monitor Corporate Infrastructure, Phishing attacks, and Brand Impersonation.

And that’s everything we have to say, for now. VTAlerts will be officially launched September 2021, and of course you will have all details here. In the meantime, if you are interested in joining our Beta program, please let us know. 

Happy hunting!


Monday, August 02, 2021

Introducing ‘Known Distributors’

 Providing more context about file provenance and distribution

These days many security operations center (SOC) teams are overwhelmed by huge volumes of alerts. Triaging these alerts takes too long, and many are never investigated at all. "Alert fatigue" leads analysts to take alerts less seriously than they should, resulting in missed threats and consummated breaches. 


One of VirusTotal’s main use cases is automatic security telemetry enrichment with the aim of performing alert triage. Indeed, VirusTotal is not only one of the largest and richest malware datasets in the world, over the years we have aggregated all sorts of security-relevant data points for files, URLs, domains and IPs, including goodware indicators and provenance details. As a result, many SOCs are using VirusTotal to perform automated false positive discarding, eradicating alert fatigue and making sure that their teams stay focused on relevant alerts enriched with superior context.


To make this use case more straightforward, today we are introducing Known Distributors, a new attribute for file objects that determines which companies/products a given file belongs to. Before having Known Distributors, we had multiple attributes in the file object to determine its origin, including:

  • nslr_info

  • trusted_verdicts

  • monitor_info

  • “software-collection” tag


Each one of these attributes was ingested from a different data source and had a different data format. For VirusTotal users it was difficult to spot the difference between the three since all had the same purpose. Now everything is unified under the same attribute, making its analysis and ingestion easier. 


Please note that the old attributes listed above will be deprecated as of January 1st, 2022. 


When to make use of this?


As said, many SOCs use VirusTotal for automatic false positive discarding. Here is when the Known Distributors property comes in handy. 


Let’s say you find the following hash 6d17958c6527346036f35c6d9db2f5c8d820cbfbd043588304c7beddf7ea8641 among a list of IoCs collected from a hypothetical incident. 


Querying VirusTotal’s API provides the following information about the sample:


$ curl --request GET \
  --url "https://www.virustotal.com/api/v3/files/6d17958c6527346036f35c6d9db2f5c8d820cbfbd043588304c7beddf7ea8641?attributes=known_distributors" \
  --header 'x-apikey:  <your API key>'
{
    "data": {
        "attributes": {
            "known_distributors": {
                "filenames": [
                    "MoUsoCoreWorker.exe"
                ],
                "products": [
                    "windows-cloud-windows-server-2004-dc-core-v20210608"
                ],
                "distributors": [
                    "Microsoft"
                ],
                "data_sources": [
                    "HashDB"
                ]
            }
        },
        "type": "file",
        "id": "6d17958c6527346036f35c6d9db2f5c8d820cbfbd043588304c7beddf7ea8641",
        "links": {
            "self": "https://www.virustotal.com/api/v3/files/6d17958c6527346036f35c6d9db2f5c8d820cbfbd043588304c7beddf7ea8641"
        }
    }
}     


Note: To get more information about these fields check out our API documentation


The information provided in the resulting JSON makes it easy to understand that Microsoft claims the file as part of Windows Server 2004. At VirusTotal’s web interface, the same information is presented to the user as follows:



In case of a file being distributed by multiple companies, it is shown as in the image below:


Where does all this data come from?


Known distributors data is ingested from the following data sources:


  • HashDB - An internal service that extracts files from base OS images and OS updates. 

  • National Software Reference Library (NSRL) - A project supported by the U.S. Department of Homeland Security, federal, state, and local law enforcement, and the National Institute of Standards and Technology (NIST) which is designed to collect software from various sources and incorporate file profiles computed from this software into a Reference Data Set (RDS) of information.

  • VT Monitor - A VirusTotal service that periodically scans files coming from software publishers to mitigate false positives.

  • Trusted Source project - A set of partnerships with key software vendors to appropriately tag files that they distribute, including those that are not signed.


We are also actively working on some partnerships with renowned software download portals so as to consolidate author and distribution metadata for software that they list. 


Why is this more relevant than ever?


ING’s global CISO recently pointed out that: “AI/ML can potentially solve issues with scalability in human analysis. Though it is working now [...], we do see challenges with AI/ML models that trigger on wrong assumptions. This results in many false positives in the security detection process, which need to be investigated by humans.


Indeed, at VirusTotal we have also seen how the inclusion of AI/ML in detection engines has led to more false positives, and, most importantly to increasing lack of context. Many detections these days do not include any malware family/toolkit label and since they are ML-powered, the analyst is provided with no additional information beyond a red flag, which in some cases might be misleading. By incorporating the Known Distributors details along with VirusTotal’s wealth of contextual information, security teams can overcome the shortcomings of noisy detection mechanisms that have yet to mature. 


We hope that these changes will ease the alert triage use case. Please let us know if you have any questions or suggestions to keep making VirusTotal better for everyone.


Happy hunting!