Experimental, use with care.
pd3f
is a PDF text extraction pipeline that is self-hosted, local-first and Docker-based.
It reconstructs the original continuous text with the help of machine learning.
pd3f
can OCR scanned PDFs with OCRmyPDF (Tesseract) and extracts tables with Camelot and Tabula.
It's built upon the output of Parsr.
Parsr detects hierarchies of text and splits the text into words, lines and paragraphs.
Even though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens. The underlying Python package pd3f-core tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces. It uses language models to guess how the original text looked like.
pd3f
is especially useful for languages with long words such as German.
It was mainly developed to parse German letters and official documents.
Besides German pd3f
supports English, Spanish, French and Italian.
More languages will be added a later stage.
pd3f
includes a Web-based GUI and a Flask-based microservice (API).
You can find a demo at demo.pd3f.com.
Check out the full Documentation at: https://pd3f.com/docs/
PDFs are hard to process and it's hard to extract information. So the results of this tool may not satisfy you. There will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.
Here some things that will get improved.
- calculate runtime based on
job.started_at
andjob.ended_at
- Get average runtime of jobs and store data in redis list
- NER
- entity linking
- extract keywords
- use textacy
- check if flair has model
- what to do if there is no fast model?
- simple client based on request
- send whole folders
- go beyond text
- reduce size
- repair PDF
- detect if scanned
- force to OCR again
- show uncertainty of ML model
- allow different log levels
- https://github.com/axa-group/Parsr
- https://github.com/jzillmann/pdf-to-markdown
- some PDF processing tools in my blog post
Install and use poetry.
Initially run:
./dev.sh --build
Omit --build
if the Docker images do not need to get build.
Right now Docker + poetry is not able to cache the installs so building the image all the time is uncool.
If you have a question, found a bug or want to propose a new feature, have a look at the issues page.
Pull requests are especially welcomed when they fix bugs or improve the code quality.
Affero General Public License 3.0