/PyPDFtoText

This is a Python script that converts any PDF to text using Tesseract-OCR(For Text locked pdfs).

Primary LanguagePython

PyPDFtoText

  • This is Python script that converts any PDF to text using tesseract-OCR. I made this to process pdfs in which text is not selectable.
  • Please donot use on normal pdfs of which you can just copy out text as this is a heavy to process and slowtask
  • it works best on simple pdfs which have data in simple book format(also depends on your tesseract installation), more updates coming soon maybe
  • This uses Tesseract-OCR binaries, pytesseract, PyMuPDF and PIL packages
  • If you cannot install fitz. try "pip install PyMuPDF"