PDF table scraper ----------------- This script attempts to extract the data of a table from a pdf file. It considers every single page of a pdf as a table, and attempts to make sense of it. The output should be much easier to parse and 'somehow clean', but a manual checking is required over the results. It currently exports the data as a .html (for visualization) as well as in .csv or in Python pickle form, for reuse in another script. ~/pdf_table_scraper$ ./pdf_table_scraper.py -h usage: pdf_table_scraper.py [-h] [--vskip VSKIP] [--page PAGE] [--html HTML] [--csv CSV] [--pickle PICKLE] [--tmp_xml TMP_XML] [-v] filename Extracts a table from a .pdf file positional arguments: filename the .pdf file optional arguments: -h, --help show this help message and exit --vskip VSKIP max vertical space between consecutive lines in the same paragraph (usually ~8) --page PAGE run the script on a specific page --html HTML A filename for html output --csv CSV A filename for csv output --pickle PICKLE A filename for Python .pickle output --tmp_xml TMP_XML A temporary XML file (output of pdftohtml) -v Increase the verbosity level