PDF Parser is a command line tool and go library that decrypts PDF files and extracts commands, files, javascript, text and urls. PDF Parser also logs formatting errors and abnormalities that are used to obfuscate malicious PDF files.
First, Install Go
Then install/update the PDF Parser
go get -u github.com/KarmaPenny/pdfparser
go clean -i github.com/KarmaPenny/pdfparser && rm -rf $(go env GOPATH)/src/github.com/KarmaPenny/pdfparser
go test github.com/KarmaPenny/pdfparser/pdf
The following command extracts the contents of input.pdf to the output directory using "password" for decryption:
$(go env GOPATH)/bin/pdfparser -p password input.pdf output/
The following program extracts the contents of input.pdf to the output directory using "password" for decryption:
package main
import "github.com/KarmaPenny/pdfparser/pdf"
func main() {
pdf.Parse("input.pdf", "password", "output")
}
PDF parser creates the following files in the output directory:
Commands run by launch actions are logged to the commands.txt file. Example:
cmd.exe /c hello.exe
calc.exe
The text content of the PDF is written to the contents.txt file.
Format errors and other abnormailites that are sometimes used to obfuscate malicious PDF files are logged to the errors.txt file. Bellow is an example errors.txt file containing the complete list of possible log messages:
invalid dictionary key type
invalid hex string character
invalid name escape character
invalid octal in string
missing dictionary value
unclosed array
unclosed dictionary
unclosed hex string
unclosed stream
unclosed string
unclosed escape in string
unclosed octal in string
unnecessary espace sequence in name
unnecessary espace sequence in string
The MD5 hash and file path of referenced embedded and external files are logged to the files.txt file. Embedded files are extracted to the output directory using the MD5 hash as the file name. The MD5 hash for external files is all zeros. Example:
6adb6f85e541f14d7ecec12a6af8ef65:hello.exe
00000000000000000000000000000000:C:\Windows\System32\calc.exe
The javascript of all actions is extracted to the javascript.js file.
A decrypted and decoded version of the PDF is written to the raw.pdf file.
All URLs referenced by actions are extracted to the urls.txt file. Example:
http://www.google.com
https://github.com/KarmaPenny