PDF Analysis

Several PDF analysis has already been done, I reassembled a lot of them with additional tips & tools here

PDF Format 📄

https://www.adobe.com/devnet/pdf/pdf_reference.html
https://blog.didierstevens.com/2008/04/09/quickpost-about-the-physical-and-logical-structure-of-pdf-files/
https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF

Tools list 🔧

Tool	URL
AnalyzePDF.py	https://github.com/hiddenillusion/AnalyzePDF
ByteForce	https://github.com/weaknetlabs/ByteForce
Caradoc	https://github.com/ANSSI-FR/caradoc
Didier Stevens suite	https://github.com/DidierStevens/DidierStevensSuite
dumppdf	https://packages.debian.org/jessie/python-pdfminer
forensics-all	https://packages.debian.org/jessie-backports/forensics-all
Origami	https://code.google.com/archive/p/origami-pdf/
ParanoiDF	https://github.com/patrickdw123/ParanoiDF
peepdf	https://github.com/jesparza/peepdf
PDF Xray	https://github.com/9b/pdfxray_public
pdf-parser	http://didierstevens.com/files/software/pdf-parser_V0_6_4.zip
pdf2jhon.py	https://github.com/magnumripper/JohnTheRipper/blob/unstable-jumbo/run/pdf2john.py
pdfcrack	https://packages.debian.org/jessie/pdfcrack
pdfextract	https://github.com/CrossRef/pdfextract
pdfobjflow.py	https://bitbucket.org/sebastiendamaye/pdfobjflow
pdfresurrect	https://packages.debian.org/jessie/pdfresurrect
PdfStreamDumper.exe	http://sandsprite.com/CodeStuff/PDFStreamDumper_Setup.exe
pdftk	https://packages.debian.org/en/jessie/pdftk
pdfxray_lite.py	https://github.com/9b/pdfxray_lite
poppler-utils	https://packages.debian.org/en/jessie/poppler-utils (pdftotext, pdfimages, pdftohtml, pdftops, pdfinfo, pdffonts, pdfdetach, pdfseparate, pdfsig, pdftocairo, pdftoppm, pdfunite)
pyew	https://packages.debian.org/en/jessie/pyew
qpdf	https://packages.debian.org/jessie/qpdf
swf_mastah.py	https://github.com/9b/pdfxray_public/blob/master/builder/swf_mastah.py

Existing list

http://blog.didierstevens.com/programs/pdf-tools/
https://github.com/sans-dfir/sift-files/tree/master/pdf-tools

Quick Analysis 🚀

Basic informations

$ file file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf

Displaying objects and actions structure

$ python pdfdid.py -aefv file.pdf

Search for /OpenAction /AA /Launch /GoTo /GoToR /SubmitForm /Richmedia (for Flash) /JS /JavaScript /URI - Encode - Cipher - Shell code - Obfuscation...

Automatically with ParanoiDF

$ python paranoiDF.py -fl file.pdf

Or with pdf-parser

$ python pdf-parser.py -v file.pdf

With an hexadecimal analyser

$ bless file.pdf

Extract files / scripts / Objects

pdf-parser to extract a js object for example

$ pdf-parser --object 32 --raw > extractedObject.js

pdfextract from Origami

$ pdfextract file.pdf

Online analysis

Beware to don't leak any important/professional/personnal data or to expose your research
https://www.hybrid-analysis.com/

Complete Analysis 🔎

Basic informations

$ file file.pdf
$ pdfinfo file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf

Powerfull Python tool to analyze PDF and exploit

$ pyew file.pdf

Other Python tool to explore PDF

$ peepdf -fl file.pdf
$ peepdf --interactive file.pdf

Analysis under Windows

PDF Stream Dumper
https://github.com/dzzie/pdfstreamdumper

Metadata

Get metadata

$ exiftool -a -u -g2 file.pdf

Get metadata recursivly from current directory

$ exiftool -r -ext pdf .

Change an element

$ exiftool -Title="New title" file.pdf

Remove metadata

$ exiftool -all= file.pdf && exiftool -all:all= file.pdf && qpdf --linearize file.pdf filewithoutmeta.pdf
$ mat file.pdf # latest version of mat doesn't support pdf format anymore...

Remove metadata recursively from the current directory : Very dirty but work well The filename must not have space at the moment, the commande will be optimized

$ find . -name "*.pdf" -print0 | while read -d $'\0' file; do echo ${file:2} && mv ${file:2} ${file:2}.pdf && exiftool -all= ${file:2}.pdf && exiftool -all:all= ${file:2}.pdf && qpdf --linearize ${file:2}.pdf ${file:2} && rm ${file:2}.pdf && rm ${file:2}.pdf_original; done

Search for older versions

Search for older "hidden" versions

$ pdfresurrect file.pdf -i
$ exiftool -pdf-update:all= file.pdf

Online Analysis

Name	URL
Malwr	https://malwr.com/submission/
Hybrid analysis	https://www.hybrid-analysis.com/
Malware Tracker	https://www.malwaretracker.com/pdf.php
VirusTotal	http://www.virustotal.com/
PDF examiner	http://www.pdfexaminer.com/
Document Analyzer	http://www.document-analyzer.net/
Jotti	https://virusscan.jotti.org/
PDF X-ray	http://www.pdfxray.com/
PDF Online	https://www.pdf-online.com/
Extract PDF	http://www.extractpdf.com
Char conversion	https://kt.pe/tools.html#conv/

Statistics

Calcul byte statistics, entropy min and max, ASCII count, ... from a PDF

$ python byte-stats.py file.pdf

Visual analysis

Visual analysis of a PDF or a binary file
http://binvis.io

Go deeper in the analysis

Displaying objects and actions structure

$ python pdfid.py --all --extra --force --verbose file.pdf

Map of the objects flows

$ pdf-parser file.pdf | ./pdfobjflow
$ eog pdfobjflow.png

Actions

Search for :
/OpenAction /AA specifies the script or action to run automatically.
/Names /AcroForm /Action can also specify and launch scripts or actions.
/JavaScript specifies JavaScript to run.
/GoTo changes the view to a specified destination within the PDF or in another PDF file.
/Launch a program or opens a document.
/URI accesses a resource by its URL.
/SubmitForm /GoToR can send data to URL.
/RichMedia can be used to embed Flash in PDF.
/ObjStm can hide objects inside an Object Stream.
/JavaScript > /J#61vaScript Beware on obfuscation technique with hex codes

With ParanoiDF

$ python paranoiDF.py -fl file.pdf

With pdf-parser

$ python pdf-parser.py -v file.pdf

With an hexadecimal analyser

$ bless file.pdf

With dumppdf

$ dumppdf -a file.pdf

Compression

Search for compression

$ strings file.pdf | grep --color "/Filter"

2 ways to decompress a PDF

$ pdftk compressed.pdf output uncompressed.pdf uncompress
$ qpdf --stream-data=uncompress compressed.pdf uncompressed.pdf

Embeded files

4 ways to search for embeded files/scripts inside a PDF

$ binwalk file.pdf
$ foremost -a -v file.pdf
$ hachoir-subfile file.pdf
$ scalpel file.pdf

Extract files / scripts / objects

Extract file corresponding to object ID, jpg for example

$ dumppdf.py -i 32 -r file.pdf > image.jpg

Extract js from an object for example

$ pdf-parser --object 32 --raw > extractedObject.js

pdfextract from Origami

$ pdfextract file.pdf

Conversion

PDF to Postscript

$ pdftops file.pdf

PDF to TXT

$ pdftotext file.pdf

PDF to JPG

$ convert file.pdf image.jpg

Non-exhaustive list of possible conversion

LZWDecode filter

Convert a PDF to Postscript without the LZWDecode filter

$ qpdf --stream-data=uncompress original.pdf decoded.pdf # Decompress it
$ pdftops decoded.pdf decoded.ps # Convert it

Encryption

PDF supports RC4 encryption (40 to 128 bits keys) and AES (128 to 256 with the Extension Level 3).
Beware with empty password.

Password recovering

Brute force a PDF with pdfcrack

$ pdfcrack -w yourDictionnary.txt file.pdf

With john

$ pdf2john.py file.pdf > x.hash
$ john --wordlist=yourDictionnary.txt x.hash

Javascript

2 ways to search for Javascript

$ pdf-parser --search=JavaScript file.pdf 
$ pdfinfo -js file.pdf

Extract an object With jsunpack

$ jsunpack-extractjs file.pdf

With pdf-parser

$ pdf-parser --object 32 --raw file.pdf > file.js

With pdfextract from Origami

$ pdfextract --js file.pdf

De-obfuscate

https://github.com/urule99/jsunpack-n

Online :
http://jsunpack.jeek.org/java/

Malzilla and SpiderMonkey can also help deobfuscate JavaScript.
Malzilla :
http://www.malzilla.org/downloads.html
SpiderMonkey :
http://www.didierstevens.com/files/software/js-1.7.0-mod.tar.gz
More details coming soon.

Add Javascript to PDF

https://didierstevens.com/files/software/make-pdf_V0_1_6.zip
https://neonprimetime.blogspot.fr/2015/03/how-to-add-javascript-to-pdf.html

Disarming a PDF

$ python pdfid.py --disarm file.pdf

Flash

Search for flash

$ python pdf-parser.py --search flash file.pdf

Extract flash with swf_mastah

$ python swf_mastah.py -f file.pdf -o ./
$ file *.swf

With pdf-parser

$ pdf-parser.py --object 32 --filter --raw file.pdf > flashFile.swf
$ file flashFile.swf

Analysing flash program

$ swfdump -Ddu flashFile.swf > flashFile.txt

More details coming soon.

Sources ℹ️

https://blog.didierstevens.com/category/pdf/
http://www.decalage.info/file_formats_security/pdf
https://zeltser.com/analyzing-malicious-documents/
https://code.google.com/archive/p/corkami/wikis/PDFTricks.wiki
https://www.sans.org/reading-room/whitepapers/malicious/owned-malicious-pdf-analysis-33443
https://digital-forensics.sans.org/blog/2009/12/14/pdf-malware-analysis/
http://fileformats.archiveteam.org/wiki/PDF

zbetcheckin/PDF_analysis

PDF Analysis

PDF Format 📄

Tools list 🔧

Existing list

Quick Analysis 🚀

Basic informations

Displaying objects and actions structure

Search for /OpenAction /AA /Launch /GoTo /GoToR /SubmitForm /Richmedia (for Flash) /JS /JavaScript /URI - Encode - Cipher - Shell code - Obfuscation...

Extract files / scripts / Objects

Online analysis

Complete Analysis 🔎

Basic informations

Powerfull Python tool to analyze PDF and exploit

Other Python tool to explore PDF

Analysis under Windows

Metadata

Search for older versions

Online Analysis

Statistics

Visual analysis

Go deeper in the analysis

Displaying objects and actions structure

Map of the objects flows

Actions

Compression

Embeded files

Extract files / scripts / objects

Conversion

LZWDecode filter

Encryption

Password recovering

Javascript

De-obfuscate

Add Javascript to PDF

Disarming a PDF

Flash

Sources ℹ️