tika vs pdftotext

Question

tika vs pdftotext

michamilz opened this issue 9 years ago · 2 comments

Could you please explain why you choose tika for text extraction from pdf files. pdftotext is a lot faster.

Answer 1 · 2016-01-06T15:05:01.000Z

Currently simply because tika has a http-API available. We're using http://givemetext.okfnlabs.org as a hosted tika instance.
I also experimented with pdftotext/pdftohtml, especially for the table recognition (see #96), but tika also includes tesseract for OCR-ing the scanned pdfs we get from some federal states.

Answer 2 · 2016-01-08T07:33:38.000Z

Thank you. Good to know this service.

For a similar website i use a multistep extraction. First try is always pdftotext. If pdftotext returns an empty textstring i run tesseract. Files from Word and Excel were converted using uniconv before running pdftotext. This works good for almost all files.