PDFrate is designed to complement existing malicious document detection mechanisms, such as signature matching and dynamic analysis systems. PDFrate is capable of detecting malicious documents, including previously unseen variants, without relying on extensive parsing or execution of the documents.
In addition to identifying malicious documents, PDFrate seeks to accurately separate opportunistic attacks, which represent the overwhelming majority of malicious activity, from targeted attacks, whose apparent goal is espionage against a small number of specific victims.
The principles behind PDFrate's operation will be expounded at ACSAC 2012 in the paper entitled Malicious PDF Detection using Metadata and Structural Features. At a high level, PDFrate uses the metadata and structural features of documents to classify them using machine learning.
This site provides ratings from multiple data sets. The quality of classification from each rating is dependent upon how well the training data aligns to the PDF to be classified. To see more about that data sets used, see the Data Sources page.
Possibly the best way to describe how PDFrate works is with examples. See the examples below and please upload your own documents.
This example demonstrates the differences between the various classification outcomes. These examples are taken from the contagio data set.
The first is a typical, if not old, broad based malware PDF. It simply exploits CVE-2010-0188 and downloads additional malicious content.
Viewing the PDFrate report shows that this is most likely malicious.
The contagio classifier rates this nearly perfectly: 99.9% malicious and 0% targeted, indicating that this is a malicious PDF, from an opportunistic threat.
Some of the other classifiers don't give as precise of a rating. This is due to the differences in the training data used in the various classifiers.
Note that while this is from same contagio data set used for training, this particular document wasn't included in the training set.
Also note the very small size and relatively low amount of metadata extracted from the document. Documents with a smaller amount of metadata/structure are generally more difficult to classify correctly due to less data to go on, but the classifiers still do fine on this sample.
The second is a typical targeted attack PDF. This PDF exploits CVE-2009-4324. Upon succesful exploitation decodes malware from within itself. Viewing the PDFrate report shows that this is most likely malicious and, furthermore, it is likely a targeted PDF. Note that this PDF is much larger and that it has more metadata and structure for the classifier to operate on. This is the type of PDF that PDFrate was designed to detect.
If the exploit in the targeted PDF is successful, in addition to malware, it "drops" a benign document for the user to view. The third document is this dropped, benign PDF. The report shows this PDF to be benign. An important note is that since this is benign, the secondary classifier (targeted) is not applicable. The secondary classifier only makes sense if the PDF is malicious.
We see in these examples some variance in results from the various classifiers. This helps demonstrate some of the benefits (and tradeoffs) of PDFrate when compared to other detection tools, such as antivirus. PDFrate doesn't give a straight answer like AV. It gives a softer answer. It is more accurate on samples that are similar to what it was trained on, but it still can do a good job on samples that differ from the training (with lower confidence) as opposed to signature based systems which totally fail on novel samples.
As a result of the way PDFrate works, it's really up to the user to determine what threshold they wish to use and which data sets best fit they data they are analyzing. The colors are set at arbitrary levels which are generally good for low false positive rate detection, but certainly aren't ideal for everyone. The description of the data sources are given to help, but the data sets with the best fit to your data is probably determined by testing samples from your environment on the various classifiers.
This example demonstrates PDFrate's effectiveness on PDF documents that are very similar, but which are different (presumably to evade signature detection). These were detected by PDFrate in an operational environment. These are typical of the type of documents sent through email which evade both SPAM detection and email gateway antivirus.
These two documents are very similar, so similar that they are considered variants of the same malicious document. They have been named after their respective document titles. One can see in the PDFrate reports that the document metadata and structure are very similar, however, distinctly different in some ways. For example, the most of the structural elements as well as some metadata, including the PDFID and CreateDate value are basically identical. On the other hand, the rest of the metadata items, such as the Title, Subject, Author, and Creator are different, appearing pseudo-random. Also, scrutinizing the "Additional Information" section of the Virustotal report, one sees that the ssdeep hashes have a large amount of overlap. Presumably, these differences are introduced to evade signature based detection.
PDFrate views these documents as nearly identically and rates them correctly as opportunistic malicious. The contagio classifier does a relatively poor job determining if this is an opportunistic or targeted threat because the training set is fairly old.
One the other hand, these documents have very different detection with AV. Not only where they essentially undetected by AV at the time they were sent, but they are detected by different AV signatures.
This example demonstrates that the mechanisms used in PDFrate provide a strong complement to traditional signature matching. The specimens have small differences in content that allow them to evade AV. The consistency of their structure and metadata allow PDFrate to detect them handily. These two approaches provide a strong complement for each other. For example, the types of features used in PDFrate make clustering or linking these two documents practical. Once clustered, it may be possible to create signatures that provide consistent coverage across these variants.
On the flip side, you could imagine a scenario where direct evasion against PDFrate was employed. This could become a reality if mechanisms similar to PDFrate are used widely. However, in doing direct evasion to PDFrate, it's likely that you either increase the effort of the attacker or make signature generation easier.
Another observation is that PDFrate's features are selected to characterize arbitrary values such that the classifier isn't reliant on repetition of static strings. For example, for the metadata items such as the Title, features such as the number of characters and types of characters used (upper case, lower case, numerals, etc.) are used. These features provide strong classification, even if the exact values change. This distinguishes PDFrate from many other machine learning based document and malware detection mechanisms which rely on repetition of exact values such bayes SPAM detection, document text clustering, and n-gram based malware detection.
This example demonstrates some qualities of the metadata and structure extraction capabilities that make PDFrate possible. This example uses three files from the contagio set.
The quality of the classificaiton PDFrate provides is heavily dependant on the quality of the features extracted from the documents. An important aspect of PDFrate is the mechanisms whereby features can be extracted reliably and accurately. The value of the metadata and structural elements extracted certainly isn’t limited to use in automated classification; much of the information extracted for machine learner could also be used for other applications including manual analysis.
The metadata and structure extraction capabilities used by PDFrate provide the following benefits:
The first specimen stumps many metadata extraction tools that I've used. However, the PDFrate report contains a large amount of extracted information.
Many tools can at least successfully extract the Producer from the second specimen. However, the PDFrate report shows other vital metadata, including filenames such as “sploit.swf” and “heapsray.swf” which is something worth observing. While the machine learner doesn’t immediately recognize this as evil like an analyst does, the generalized features based on this data are important for classification.
Most tools ignore some of the artifacts found in this specimen. They successfully extract the professed values for Title, Producer, Creator, document identifiers, dates, etc. On the other hand, PDFrate extracts the other values for these fields. This data provides strong forensics information that could be used by a manual analyst to help explain by whom, when, and how the document was created. In turn, the machine learner in PDFrate is able to use features based on this data to help classify the document.