This repository provides tools to extract underlined text from any PDF document. The process automates the extraction and recognition of underlined text using a series of steps that convert PDFs into structured data for easy processing.
The process involves several key steps:
- XML Structuring: The PDF is converted into a structured XML using
PyQuery
. This step allows for easier manipulation and querying of the document's content. - Component Extraction: Specific components that denote underlining in the XML are identified and extracted. This step focuses on retrieving only the underlined parts of the document.
- Image Slice: The underlined sections of the PDF are sliced out and saved into memory as PNG images. This prepares the content for optical character recognition.
- Optical Character Recognition (OCR):
pytesseract
is used to perform OCR on the sliced images to read and convert the visual data into text. - Results Compilation: The extracted text is compiled into an array, providing a structured output of all underlined text elements from the original PDF.
To use this repository, you will need Python and several dependencies, including pytesseract
for OCR capabilities.
Ensure you have Python installed, then set up a virtual environment for the project (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
Install required Python libraries:
pip install -r requirements.txt
For Ubuntu: Run the following commands in your terminal to update your package list and install Tesseract OCR and its development libraries:
sudo apt update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
For Mac: Use Homebrew to install Tesseract by running the following command:
brew install tesseract
For Windows:
- Download the installer from Tesseract at UB Mannheim.
- It is recommended to install Tesseract into the default directory (C:\Program Files\Tesseract-OCR) to ensure compatibility. After installing Tesseract, you may need to specify the path to the tesseract executable in your Python script if it's not automatically recognized:
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'