robbi5/kleineanfragen

tika vs pdftotext

michamilz opened this issue · 2 comments

Could you please explain why you choose tika for text extraction from pdf files. pdftotext is a lot faster.

Currently simply because tika has a http-API available. We're using http://givemetext.okfnlabs.org as a hosted tika instance.
I also experimented with pdftotext/pdftohtml, especially for the table recognition (see #96), but tika also includes tesseract for OCR-ing the scanned pdfs we get from some federal states.

Thank you. Good to know this service.

For a similar website i use a multistep extraction. First try is always pdftotext. If pdftotext returns an empty textstring i run tesseract. Files from Word and Excel were converted using uniconv before running pdftotext. This works good for almost all files.