/verbaendeliste-bundestag

Python Script that extracts info from pdftohtml xml output of PDF of Bundestag Lobby List

Primary LanguagePythonMIT LicenseMIT

Verbaendeliste-Bundestag Extractor

Use pdftohtml to get an XML file from the pdf.

pdftohtml -xml input.pdf output.xml

Then use the extractor with first and last relevant page number to convert to parsed JSON:

python extract_lobby.py 4 690 < lobbylist.xml > lobbylist.json

Here is extracted JSON (15th of June 2012).

License: MIT-License