Add to README "Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information."

Question

Add to README "Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information."

Closed this issue 4 years ago · 2 comments

https://zatech.slack.com/archives/C019Y5J754K/p1601720577120900

zoid:progress-pride-flag: Today at 12:22 PM
Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information.
ruh-roh

JD Bothma 2 minutes ago
We need to OCR this file first

JD Bothma 1 minute ago
you can use this https://gist.github.com/jbothma/1245ac0ba9c51a8df6676b475ed47fc9

zoid:progress-pride-flag: 1 minute ago
You're a hero

zoid:progress-pride-flag: 1 minute ago
Ty 🙏

JD Bothma 1 minute ago
since tesseract 4 is now stable you can drop the docker stuff

JD Bothma 1 minute ago
are the pages all the right way up?

JD Bothma 1 minute ago
that trips tesseract up

JD Bothma < 1 minute ago
the output file -OCRd.pdf is what you then want to put into Tabula

Answer 1 · 2020-10-03T10:26:31.000Z

of course that script is for very techy people

If you have good PDF OCR software you can use that.

If you don't, ask someone for help

Answer 2 · 2020-10-03T10:29:52.000Z

Dropped the Docker bit in the gist:

https://gist.github.com/zoidbergwill/e48ddeab1552c868a4c140fd14c4aeb2

On MacOS I just did brew install tesseract to get tesseract 4