jesparza/peepdf

Allow export of all JavaScript code into a compound log

forkrul opened this issue · 7 comments

It would be useful to not only be able to use the interactive mode for manual checking but to batch dump all JavaScript code from the cli.

Use case scenario, various documents need to be inspected (possibly hundreds), so interactive inspection will take too long.

Perhaps I am missing a rather simple mechanism to do this?

A basic and simple way to get what you want ;)

 from PDFCore import PDFParser
 pdfparser = PDFParser()
 ret, pdf = pdfparser.parse(<pdf_file>, forceMode=True, looseMode=True)
 js = pdf.getJavascriptCode(None)

By the way, I'd be okay to make a PR to implement an optionnal arg to have exhaustive output with javascript code and uri in the json/xml output.
@jesparza thoughts ?

Hi! I silently introduced that last week but I did not have time to announce it, besides it was not extensively checked.

6853b7c

Update and take a look, please! :) The command is extract, so you could do:

PPDF> extract js > all_my_js_code.txt

I checked this and it is nice, but I was talking about being able to do like
peepdf.py malicious_file.pdf --json --extract > all_my_js_code.txt
My goal is to integrate peepdf in an automated malicious file analysis system. As I mentionned above, I am able to do it while using peepdf 'as a lib'. But while doing it I would prefer to collaborate to your project =)

Yep, that was also in the todo, hehe. Check this pull request:

nikita-sah@0e16f3f

I still have to take a look, hopefully somewhere this week, but it looks good ;) You are more than welcome to collaborate, of course! :)

Talking about the JSON/XML output, I think it would be more elegant separating the command implementation from the PDFConsole to something like PDFCommands and then use the functions there in the PDFConsole, but I think need some time to check the best way to add this JSON output ;) If you have idea, share them, please :)

Of course, we could pass a parameter to PDFConsole saying that we want the JSON output, but probably not the most elegant way, but the fastest and maybe an option to consider...

To actually analyse further away the pdf, I would be glad to get the same json than provided by python peepdf.py --json but with also the full js code. The result is for sure not to output by default to the --json arg, it would be just a flood of the shell.
Once we get this "full" output, it would be possible to know from which object come from each peace of code, and also to have object referenced in this object (and it would be convenient to also get the content of those objects). Then it would be easier to "rebuild" the malicious behaviour while following those references from the /OpenAction (or /AA) to whatever referenced stuff...
I precise that I am not so used to PDF analysis, so I think it's relevant to be able to follow references (at least from suspicious objects) but feel free to bash me =D

Anyway, for now I just want to focus on getting a "full" output. In this purpose I checked code and I noticed that to create info about a file (like the output provided by python peepdf.py <pdf_file> : stats + vuln check) you have pretty much 4 different code which does this. GetPeepJson, GetPeepXML, PDFConsole.do_info and default code exectued by peepdf.py if no arg provided.
So I would like to create a function like getInfo() which will basically provide the same output and, optionnaly, all js code, all uri. and precise from which objects does those stuff come from.
And I think it would be nice to refactor your code : use this new getInfo() in GetPeepJson, GetPeepXml, in the console and in peepdf.py. In this way GetPeepJson could look approximately like :

def GetPeepJson():
    return json.dumps(getInfo(exhaustiveOuput=False))

Of the course, the thre other code portion will need more tweak than GetPeepJson(), it is just to give you an idea.

Thoughts ?

Well, there is already a function used by the functions you mention (GetPeepJson, GetPeepXML, PDFConsole.do_info) and it is called getStats(). This function returns a dictionary with the information about the file and these functions just create the specific output. Probably, it would be better if we would structure the code in a different way, especially thinking in executing commands from the shell, but I think we will come to that point after some more important tweaks ;)

Thanks for taking the time to check this ;) I will close this issue because the main problem is solved with the new "execute" command and I will work on the Issue 39.