PDF Underlined Text Extractor

This repository provides tools to extract underlined text from any PDF document. The process automates the extraction and recognition of underlined text using a series of steps that convert PDFs into structured data for easy processing.

How It Works

The process involves several key steps:

XML Structuring: The PDF is converted into a structured XML using PyQuery. This step allows for easier manipulation and querying of the document's content.
Component Extraction: Specific components that denote underlining in the XML are identified and extracted. This step focuses on retrieving only the underlined parts of the document.
Image Slice: The underlined sections of the PDF are sliced out and saved into memory as PNG images. This prepares the content for optical character recognition.
Optical Character Recognition (OCR): pytesseract is used to perform OCR on the sliced images to read and convert the visual data into text.
Results Compilation: The extracted text is compiled into an array, providing a structured output of all underlined text elements from the original PDF.

Installation and Setup

To use this repository, you will need Python and several dependencies, including pytesseract for OCR capabilities.

Installing Python Dependencies

Ensure you have Python installed, then set up a virtual environment for the project (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install required Python libraries:

pip install -r requirements.txt

Installing Pytesseract

For Ubuntu: Run the following commands in your terminal to update your package list and install Tesseract OCR and its development libraries:

sudo apt update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

For Mac: Use Homebrew to install Tesseract by running the following command:

brew install tesseract

For Windows:

Download the installer from Tesseract at UB Mannheim.
It is recommended to install Tesseract into the default directory (C:\Program Files\Tesseract-OCR) to ensure compatibility. After installing Tesseract, you may need to specify the path to the tesseract executable in your Python script if it's not automatically recognized:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

sasha-korovkina/pdfUnderlinedExtractor

PDF Underlined Text Extractor

How It Works

Installation and Setup

Installing Python Dependencies

Installing Pytesseract