Monday, August 25, 2025

Applying AI Analysis to PDF Threats

In our previous post we extended VirusTotal Code Insights to browser extensions and supply-chain artifacts. A key finding from that analysis was how our AI could apply contextual knowledge to its evaluation. It wasn’t just analyzing code in isolation, it was correlating a package's stated purpose (its name and description) with its actual behavior, flagging malicious logic that contradicted its public description. We’re now applying the same idea to one of the most common file formats in the world, the PDF.


Audio version of this post, created with NotebookLM Deep Dive

PDFs are multi-layered. There’s the object tree (catalog, pages, objects, streams, actions, embedded files) and there’s the visible layer (text/images the user reads). Code Insights analyzes both, then correlates: does the document content, claims, and branding make sense given its internal behaviors? That lets us surface not only classic PDF exploitation (e.g., auto-actions, JS, external launches) but also pure social engineering (phishing, vishing, QR-lures) even when the file has no executable logic. This dual approach allows the AI not only to detect malicious code but also to identify sophisticated scams.

Let's look at real-world samples surfaced by Code Insights during its initial testing phase. We'll start with cases where the PDF contains no malicious code, which traditional engines often miss because there's no executable payload to detect. This is where Code Insights proves useful, identifying clear signs of fraud and social engineering that aim to manipulate the user, not the machine.


Case 1 - Fake debt collection targeting financial fraud

This PDF is a real-world sample sent to VirusTotal and captured by Code Insights during early testing. It was flagged as malicious based entirely on its visible content, without relying on any embedded code or execution logic. The file was marked as clean by all other engines, likely because it contains no scripts, exploits, or embedded payloads.

d92a1a7460c580f8bf6af3cbd39c7840cfe6a146ee15ede8e23c50c2a85becb9

The document pretends to be a debt collection notice from a German agency acting on behalf of Amazon. It includes a formal layout, legal threats, payment instructions, and multiple references to German addresses and regulations. Visually, it looks legitimate.


However, the AI flagged it as fraudulent based on several critical inconsistencies, the most important one being the destination bank account. The payment is requested to an IBAN starting with BG, indicating a Bulgarian account. This contradicts the sender's claimed German identity and would be highly unusual for a legitimate German debt agency. This mismatch alone was enough for Code Insights to classify the file as fraudulent. Additional content cues (urgent tone, fee breakdown, legal pressure) support the assessment.

As described in the Code Insights analysis:

“The visual and textual content confirms the document is a sophisticated phishing attack. It masquerades as an urgent payment demand from a German debt collection agency, supposedly on behalf of Amazon. The document employs high-pressure tactics, including threats of legal action, additional fees, and credit score damage, to compel the recipient to act quickly. The primary and most conclusive indicator of fraud is the demand for payment to a Bulgarian bank account, which is a stark and highly irregular contradiction to the agency's purported German location and registration.”

This is a case where AI adds value by reasoning over the content semantics, not the file structure.


Case 2 - QR-based phishing (quishing) campaign

This is another real-world PDF captured during early testing of Code Insights. At the time of analysis, no antivirus or malware detection engines flagged the file as malicious. The PDF has no embedded scripts, exploits, or execution logic. From a technical perspective, it looks benign.

259e202847d04866acd76427f53bfd9a15372ed6ed56a9e54ba1c62442c945ee

The visible content, however, impersonates an HR notification about a salary increase. It includes multiple social engineering red flags: awkward grammar, lack of personalization, and an irrelevant privacy disclaimer. The only call to action is a QR code, encouraging the recipient to scan it for more details.


Code Insights analyzed and decoded the QR, extracting the hidden URL. The domain is non-corporate and clearly unrelated to HR or payroll systems. The combination of deceptive HR messaging with a QR code that conceals a phishing URL confirms the document is a credential harvesting fraud delivered via PDF.


Case 3 - Vishing via fake PayPal alert

This is another real-world PDF flagged by Code Insights during early evaluation. No antivirus or malware detection engines classified the file as malicious. Structurally, it’s simple and inert: there are no scripts, automatic actions, or embedded links. Minor stream decoding errors are present but considered low-risk anomalies.

d0bedc70085efff5218b901cdaba95d565df867495181544041ba4b8a6019cea


The threat lies entirely in the content. The document impersonates PayPal and trusted brands like Visa to deliver a fake security alert about a high-value unauthorized purchase. The language is urgent and designed to induce panic.

According to Code Insights:

“[...]the visual content of the document is a clear social engineering lure designed for a voice phishing (vishing) attack. [...] The document's sole purpose is to persuade the user to call a specific phone number under the pretense of canceling the fraudulent order. The malicious nature is confirmed by several red flags, including an awkwardly phrased greeting and a phone number with a geographic area code (808) that is deceptively labeled as "Toll-Free." This tactic aims to route the victim to a scammer for social engineering and potential fraud.”


Case 4 - Fake Tax Refund from the Australian Taxation Office

As with previous cases, this PDF wasn’t flagged by any antivirus engine in VirusTotal, but Code Insights identified it as a phishing lure that impersonates the Australian Taxation Office.

b9b763e4b091bc59e9b9f355617622dbabdc1ff2de6707a94ccb26aa7682300e


As described by Code Insights:

“This document is a phishing lure designed to impersonate the Australian Taxation Office (ATO). The visual layer uses an authentic-looking government logo and the promise of a tax refund to entice the recipient into clicking an "Access Document" button. The purpose is to have the user provide an electronic signature for a supposed refund authorization, creating a sense of urgency and financial incentive. The document exhibits multiple red flags common to phishing attacks. These include a generic greeting, a suspicious reference to a .doc file (a common malware vector), instructions that discourage direct replies, and a complete lack of legitimate contact information or alternative methods for verification. The entire premise relies on tricking the user into clicking the button, which likely leads to a malicious website for credential theft or malware download.”


Auto-executing PDF Posing as a Movie Download

Unlike previous examples, this PDF was flagged by 13 antivirus engines in VirusTotal. In this case, the attack is embedded both in the internal structure of the file and its visual appearance. Code Insights correlates these two layers, the technical and the social, to expose the malicious intent.

44e653fe79d1ab160c784c06f4d99def6419e379ef3f802af9f48d595976d2c7


As described by Code Insights:

“The document presents a social engineering lure, masquerading as a download page for pirated movies […] to entice users into clicking links. This theme, centered on illegal content distribution, is a common tactic for malware delivery. Technical analysis of the PDF's internal structure corroborates the malicious intent. The file is configured with an /OpenAction command, a high-risk feature designed to automatically execute an action upon the document being opened […] The combination of a deceptive, high-risk theme with an automatic execution function indicates that the document’s purpose is to compromise the user's system.”

We are actively improving Code Insight based on what we learn from these early cases. PDF is the 6th most common file type submitted to VirusTotal, with around 100,000 new samples uploaded every day. That volume requires us to be strategic: for now, only a selected percentage of PDF files submitted via the public web interface are processed by Code Insight, as we test, tune, and scale the system.

These first results are helping us refine both effectiveness and performance. We’ll continue expanding coverage as we improve detection of threats.

0 comments:

Post a Comment