This projects takes an input pdf and outputs the text content to a text file. It can handle larger pdf files by converting each page to its own text file in parallel. Then each file is merged to one text file.
- Clone this repository
- Install the requirements (pip install -r requirements.txt)
- Change the path to the pdf file (PDF_file in main())
- Run the script (python main.py)
The output is a text file with the content.
The script is bad at handling non-English languages and special characters.# pdf-ocr