keep-the-receipts/data-extraction

Add to README "Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information."

Closed this issue · 2 comments

https://zatech.slack.com/archives/C019Y5J754K/p1601720577120900

zoid:progress-pride-flag: Today at 12:22 PM
Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information.
ruh-roh

JD Bothma 2 minutes ago
We need to OCR this file first

JD Bothma 1 minute ago
you can use this https://gist.github.com/jbothma/1245ac0ba9c51a8df6676b475ed47fc9

zoid:progress-pride-flag: 1 minute ago
You're a hero

zoid:progress-pride-flag: 1 minute ago
Ty 🙏

JD Bothma 1 minute ago
since tesseract 4 is now stable you can drop the docker stuff

JD Bothma 1 minute ago
are the pages all the right way up?

JD Bothma 1 minute ago
that trips tesseract up

JD Bothma < 1 minute ago
the output file -OCRd.pdf is what you then want to put into Tabula

of course that script is for very techy people

If you have good PDF OCR software you can use that.

If you don't, ask someone for help

Dropped the Docker bit in the gist:

https://gist.github.com/zoidbergwill/e48ddeab1552c868a4c140fd14c4aeb2

On MacOS I just did brew install tesseract to get tesseract 4