pdf2excel: A JavaScript repository from samsmith

license: BSD


For programmers:
	- json of the raw parse is dumped into all the excel comment fields where there's stuff in the text - you can just ignore the actual content

	- if you're careful, you can, possibly, try getting with a full size bounding box for page 0 and it might work (for some PDFs), which means you can automate this in 2 get requests. We'll work on this simple. The code you get back is the MD5 hex checksum of the pdf you ask it for.

	 -It requires the pdftohtml to be in /usr/local/bin (the one with the -xml option - pdf<i>2</i>html wont do), and the following perl modules:
		- use CGI qw/param/;
        	- use LWP::Simple;
        	- use File::Temp;
        	- use Digest::MD5 qw/md5_hex/;
        	- use XML::Simple;
        	- use JSON qw/to_json/;
        	- use Spreadsheet::WriteExcel;

	- Please don't hit this script with automated scrapers yet. At some point soon, I'll stick it on github so you can run it yourself once I'm happy it's actually substantially ok and passes the minimum stuff you need it to pass (at the moment, it has a couple of major missing bits).


contact: pdf2excel.pl@msmith.net
samsmith/pdf2excel