/PDF_analysis

Several PDF analysis reassembled with additional tips and tools

PDF Analysis

Several PDF analysis has already been done, I reassembled a lot of them with additional tips & tools here

PDF Format 📄

alt text


https://www.adobe.com/devnet/pdf/pdf_reference.html
https://blog.didierstevens.com/2008/04/09/quickpost-about-the-physical-and-logical-structure-of-pdf-files/
https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF

Tools list 🔧

Tool URL
AnalyzePDF.py https://github.com/hiddenillusion/AnalyzePDF
ByteForce https://github.com/weaknetlabs/ByteForce
Caradoc https://github.com/ANSSI-FR/caradoc
Didier Stevens suite https://github.com/DidierStevens/DidierStevensSuite
dumppdf https://packages.debian.org/jessie/python-pdfminer
forensics-all https://packages.debian.org/jessie-backports/forensics-all
Origami https://code.google.com/archive/p/origami-pdf/
ParanoiDF https://github.com/patrickdw123/ParanoiDF
peepdf https://github.com/jesparza/peepdf
PDF Xray https://github.com/9b/pdfxray_public
pdf-parser http://didierstevens.com/files/software/pdf-parser_V0_6_4.zip
pdf2jhon.py https://github.com/magnumripper/JohnTheRipper/blob/unstable-jumbo/run/pdf2john.py
pdfcrack https://packages.debian.org/jessie/pdfcrack
pdfextract https://github.com/CrossRef/pdfextract
pdfobjflow.py https://bitbucket.org/sebastiendamaye/pdfobjflow
pdfresurrect https://packages.debian.org/jessie/pdfresurrect
PdfStreamDumper.exe http://sandsprite.com/CodeStuff/PDFStreamDumper_Setup.exe
pdftk https://packages.debian.org/en/jessie/pdftk
pdfxray_lite.py https://github.com/9b/pdfxray_lite
poppler-utils https://packages.debian.org/en/jessie/poppler-utils (pdftotext, pdfimages, pdftohtml, pdftops, pdfinfo, pdffonts, pdfdetach, pdfseparate, pdfsig, pdftocairo, pdftoppm, pdfunite)
pyew https://packages.debian.org/en/jessie/pyew
qpdf https://packages.debian.org/jessie/qpdf
swf_mastah.py https://github.com/9b/pdfxray_public/blob/master/builder/swf_mastah.py

Existing list

http://blog.didierstevens.com/programs/pdf-tools/
https://github.com/sans-dfir/sift-files/tree/master/pdf-tools

Quick Analysis 🚀

Basic informations

$ file file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf

Displaying objects and actions structure

$ python pdfdid.py -aefv file.pdf

Search for /OpenAction /AA /Launch /GoTo /GoToR /SubmitForm /Richmedia (for Flash) /JS /JavaScript /URI - Encode - Cipher - Shell code - Obfuscation...

Automatically with ParanoiDF

$ python paranoiDF.py -fl file.pdf

Or with pdf-parser

$ python pdf-parser.py -v file.pdf

With an hexadecimal analyser

$ bless file.pdf

Extract files / scripts / Objects

pdf-parser to extract a js object for example

$ pdf-parser --object 32 --raw > extractedObject.js

pdfextract from Origami

$ pdfextract file.pdf

Online analysis

Beware to don't leak any important/professional/personnal data or to expose your research
https://www.hybrid-analysis.com/

Complete Analysis 🔎

Basic informations

$ file file.pdf
$ pdfinfo file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf

Powerfull Python tool to analyze PDF and exploit

$ pyew file.pdf 	

Other Python tool to explore PDF

$ peepdf -fl file.pdf
$ peepdf --interactive file.pdf

Analysis under Windows

PDF Stream Dumper
https://github.com/dzzie/pdfstreamdumper

Metadata

Get metadata

$ exiftool -a -u -g2 file.pdf

Get metadata recursivly from current directory

$ exiftool -r -ext pdf .

Change an element

$ exiftool -Title="New title" file.pdf

Remove metadata

$ exiftool -all= file.pdf && exiftool -all:all= file.pdf && qpdf --linearize file.pdf filewithoutmeta.pdf
$ mat file.pdf # latest version of mat doesn't support pdf format anymore...

Remove metadata recursively from the current directory : Very dirty but work well The filename must not have space at the moment, the commande will be optimized

$ find . -name "*.pdf" -print0 | while read -d $'\0' file; do echo ${file:2} && mv ${file:2} ${file:2}.pdf && exiftool -all= ${file:2}.pdf && exiftool -all:all= ${file:2}.pdf && qpdf --linearize ${file:2}.pdf ${file:2} && rm ${file:2}.pdf && rm ${file:2}.pdf_original; done

Search for older versions

Search for older "hidden" versions

$ pdfresurrect file.pdf -i
$ exiftool -pdf-update:all= file.pdf

Online Analysis

Name URL
Malwr https://malwr.com/submission/
Hybrid analysis https://www.hybrid-analysis.com/
Malware Tracker https://www.malwaretracker.com/pdf.php
VirusTotal http://www.virustotal.com/
PDF examiner http://www.pdfexaminer.com/
Document Analyzer http://www.document-analyzer.net/
Jotti https://virusscan.jotti.org/
PDF X-ray http://www.pdfxray.com/
PDF Online https://www.pdf-online.com/
Extract PDF http://www.extractpdf.com
Char conversion https://kt.pe/tools.html#conv/

Statistics

Calcul byte statistics, entropy min and max, ASCII count, ... from a PDF

$ python byte-stats.py file.pdf

Visual analysis

Visual analysis of a PDF or a binary file
http://binvis.io

Go deeper in the analysis

Displaying objects and actions structure

$ python pdfid.py --all --extra --force --verbose file.pdf

Map of the objects flows

$ pdf-parser file.pdf | ./pdfobjflow
$ eog pdfobjflow.png

Actions

Search for :
/OpenAction /AA specifies the script or action to run automatically.
/Names /AcroForm /Action can also specify and launch scripts or actions.
/JavaScript specifies JavaScript to run.
/GoTo changes the view to a specified destination within the PDF or in another PDF file.
/Launch a program or opens a document.
/URI accesses a resource by its URL.
/SubmitForm /GoToR can send data to URL.
/RichMedia can be used to embed Flash in PDF.
/ObjStm can hide objects inside an Object Stream.
/JavaScript > /J#61vaScript Beware on obfuscation technique with hex codes

With ParanoiDF

$ python paranoiDF.py -fl file.pdf

With pdf-parser

$ python pdf-parser.py -v file.pdf

With an hexadecimal analyser

$ bless file.pdf

With dumppdf

$ dumppdf -a file.pdf

Compression

Search for compression

$ strings file.pdf | grep --color "/Filter"

2 ways to decompress a PDF

$ pdftk compressed.pdf output uncompressed.pdf uncompress
$ qpdf --stream-data=uncompress compressed.pdf uncompressed.pdf 

Embeded files

4 ways to search for embeded files/scripts inside a PDF

$ binwalk file.pdf
$ foremost -a -v file.pdf
$ hachoir-subfile file.pdf
$ scalpel file.pdf

Extract files / scripts / objects

Extract file corresponding to object ID, jpg for example

$ dumppdf.py -i 32 -r file.pdf > image.jpg

Extract js from an object for example

$ pdf-parser --object 32 --raw > extractedObject.js

pdfextract from Origami

$ pdfextract file.pdf

Conversion

PDF to Postscript

$ pdftops file.pdf

PDF to TXT

$ pdftotext file.pdf

PDF to JPG

$ convert file.pdf image.jpg

Non-exhaustive list of possible conversion

LZWDecode filter

Convert a PDF to Postscript without the LZWDecode filter

$ qpdf --stream-data=uncompress original.pdf decoded.pdf # Decompress it
$ pdftops decoded.pdf decoded.ps # Convert it

Encryption

PDF supports RC4 encryption (40 to 128 bits keys) and AES (128 to 256 with the Extension Level 3).
Beware with empty password.

Password recovering

Brute force a PDF with pdfcrack

$ pdfcrack -w yourDictionnary.txt file.pdf

With john

$ pdf2john.py file.pdf > x.hash
$ john --wordlist=yourDictionnary.txt x.hash

Javascript

2 ways to search for Javascript

$ pdf-parser --search=JavaScript file.pdf 
$ pdfinfo -js file.pdf

Extract an object With jsunpack

$ jsunpack-extractjs file.pdf

With pdf-parser

$ pdf-parser --object 32 --raw file.pdf > file.js

With pdfextract from Origami

$ pdfextract --js file.pdf

De-obfuscate

https://github.com/urule99/jsunpack-n

Online :
http://jsunpack.jeek.org/java/

Malzilla and SpiderMonkey can also help deobfuscate JavaScript.
Malzilla :
http://www.malzilla.org/downloads.html
SpiderMonkey :
http://www.didierstevens.com/files/software/js-1.7.0-mod.tar.gz
More details coming soon.

Add Javascript to PDF

https://didierstevens.com/files/software/make-pdf_V0_1_6.zip
https://neonprimetime.blogspot.fr/2015/03/how-to-add-javascript-to-pdf.html

Disarming a PDF

$ python pdfid.py --disarm file.pdf

Flash

Search for flash

$ python pdf-parser.py --search flash file.pdf

Extract flash with swf_mastah

$ python swf_mastah.py -f file.pdf -o ./
$ file *.swf

With pdf-parser

$ pdf-parser.py --object 32 --filter --raw file.pdf > flashFile.swf
$ file flashFile.swf

Analysing flash program

$ swfdump -Ddu flashFile.swf > flashFile.txt

More details coming soon.

Sources ℹ️

https://blog.didierstevens.com/category/pdf/
http://www.decalage.info/file_formats_security/pdf
https://zeltser.com/analyzing-malicious-documents/
https://code.google.com/archive/p/corkami/wikis/PDFTricks.wiki
https://www.sans.org/reading-room/whitepapers/malicious/owned-malicious-pdf-analysis-33443
https://digital-forensics.sans.org/blog/2009/12/14/pdf-malware-analysis/
http://fileformats.archiveteam.org/wiki/PDF