A python wrapper for pdf-extract, a Java library for HTML extraction from PDF documents.
Dependencies:
- jpype
- chardet
The pdf-extract jar files will get fetched and included automatically when building the package.
Checkout the code:
git clone https://github.com/bitextor/python-pdfextract.git
cd python-pdfextract
virtualenv
virtualenv env
source env/bin/activate
pip install -r requirements.txt
python setup.py install
Fedora
sudo dnf install -y python2-jpype
sudo python setup.py install
Also you can now directly install without explicitly running setup.py
or checkout the code:
pip
pip install python-pdfextract # Stable releases
pip install git+https://github.com/bitextor/python-pdfextract.git # master code
pip install git+https://github.com/bitextor/python-pdfextract.git@branchname # development "branchname" code
Be sure to have set JAVA_HOME
properly since jpype
depends on this setting.
from pdfextract.extract import Extractor
extractor = Extractor(pdf=your_pdf_data)
An advanced way to create the Extractor is: extractor = Extractor(pdf=your_pdf_data, keepBrTags=0, getPermission=0, logFilePath="", verbose=0, configFile="", timeout=0, sentenceJoinPath="", kenlmPath="")
which contains the same arguments as PDFExtract command line options.
Then, to extract relevant content:
extracted_html = extractor.extract()