/ocr_pdfs

Scripts for OCR

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

ocr_pdfs

This repo has the following scripts for extracting text from pdfs.

tika_pdfs.py    - for text that has already been ocr'ed
ocr_pdfs.py     - for text that has not already been ocr'red 

There are comments in the scripts including where the path to pdfs and results are to be changed. Both the scripts at the end gives out results in .txt format.