tika vs pdftotext
michamilz opened this issue · 2 comments
Could you please explain why you choose tika for text extraction from pdf files. pdftotext is a lot faster.
Currently simply because tika has a http-API available. We're using http://givemetext.okfnlabs.org as a hosted tika instance.
I also experimented with pdftotext/pdftohtml, especially for the table recognition (see #96), but tika also includes tesseract for OCR-ing the scanned pdfs we get from some federal states.
Thank you. Good to know this service.
For a similar website i use a multistep extraction. First try is always pdftotext. If pdftotext returns an empty textstring i run tesseract. Files from Word and Excel were converted using uniconv before running pdftotext. This works good for almost all files.