Add to README "Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information."
Closed this issue · 2 comments
https://zatech.slack.com/archives/C019Y5J754K/p1601720577120900
zoid:progress-pride-flag: Today at 12:22 PM
Sorry, your PDF file is image-based; it does not have any embedded text. It might have been scanned from paper... Tabula isn't able to extract any data from image-based PDFs. Click the Help button for more information.
ruh-roh
JD Bothma 2 minutes ago
We need to OCR this file first
JD Bothma 1 minute ago
you can use this https://gist.github.com/jbothma/1245ac0ba9c51a8df6676b475ed47fc9
zoid:progress-pride-flag: 1 minute ago
You're a hero
zoid:progress-pride-flag: 1 minute ago
Ty 🙏
JD Bothma 1 minute ago
since tesseract 4 is now stable you can drop the docker stuff
JD Bothma 1 minute ago
are the pages all the right way up?
JD Bothma 1 minute ago
that trips tesseract up
JD Bothma < 1 minute ago
the output file -OCRd.pdf is what you then want to put into Tabula
of course that script is for very techy people
If you have good PDF OCR software you can use that.
If you don't, ask someone for help
Dropped the Docker bit in the gist:
https://gist.github.com/zoidbergwill/e48ddeab1552c868a4c140fd14c4aeb2
On MacOS I just did brew install tesseract
to get tesseract 4