/pdf-ocr

This projects takes an input pdf and outputs the text content to a text file.

Primary LanguagePython

What is this project about?

This projects takes an input pdf and outputs the text content to a text file. It can handle larger pdf files by converting each page to its own text file in parallel. Then each file is merged to one text file.

How to use this project?

  1. Clone this repository
  2. Install the requirements (pip install -r requirements.txt)
  3. Change the path to the pdf file (PDF_file in main())
  4. Run the script (python main.py)

What is the output?

The output is a text file with the content.

Improvement areas

The script is bad at handling non-English languages and special characters.# pdf-ocr