/pdf2text

This repo contains the code to extract text from pdf/picture/scanned document.

Primary LanguagePython

pdf2text

This repo contains the code to extract text from pdf/picture/scanned document.

OCR

OCR (Optical Character Recognition) technique is used to identify words in a picture/scanned document and convert it into machine-readable text, that can be processed further with the help of computer. Although the technology is mature and uses advanced techniques, which quite often produces an erroneous output.

This repo contains the code for BYOB Challenge: OCR De-noising.

Steps to run the program:

  1. Clone the repository using:

    git clone https://github.com/ViswanathaReddyGajjala/pdf2text.git

  2. Please go to pdftoimage.com to convert the pdf file to jpg image.

  3. We need to place the pdf and the correponding images in /data/demo folder.

  4. Now, run the demo.py file.

  5. Result can be seen on the command line(for windows users) or terminal(for Ubuntu users).

    • Note: Please remove the files previously being compiled in the /data/demo and _/data/result folder.

Results

  • Localized text proposals on a pdf.

* Bounding box co-ordinates can be found in data/results/*.txt .