OCR Text Extraction Project

Overview

This project extracts text from a set of 75 images using three different Optical Character Recognition (OCR) tools: EasyOCR, PyTesseract, and TrOCR. The results are then analyzed to identify discrepancies, and a consensus text is determined and corrected for common OCR errors. The final output is saved as a JSON file.

Project Structure
Dependencies
Installation
Usage
Error Handling
Performance Optimization
Output Quality
Logging
Future Improvements

Project Structure

.
├── selected_images/            # Directory containing the 75 images
├── ocr_results.json            # Output JSON file with extracted and corrected text
├── ocr_processing.log          # Log file for tracking processing details and errors
├── main.py                     # Main script for executing OCR and text processing
└── README.md                   # This readme file

Dependencies

The project requires the following libraries:

Pillow (PIL)
easyocr
pytesseract
transformers (for TrOCR)
torch
concurrent.futures
tqdm
Levenshtein (for choosing the best text among 3 models)

Ensure you have the required tools installed:

Tesseract OCR: Tesseract Installation Guide
Python 3.8 or higher

Installation

Clone the repository:

git clone https://github.com/yujansaya/raha_beach_ocr/
cd <your repository-directory>

Install the Python dependencies:
```
pip install -r requirements.txt
```
Ensure Tesseract OCR is installed:
- On Linux: sudo apt-get install tesseract-ocr
- On macOS: brew install tesseract
- On Windows: Download the Tesseract installer

Usage

Prepare the input images:
- Place the 75 images in the selected_images/ directory.
Run the main script:
```
python main.py
```
View the results:
- The extracted and corrected text will be saved in ocr_results.json.
- Logs can be found in ocr_processing.log.

Example Command

python main.py

Error Handling

The script includes comprehensive error handling:

Logging Errors: Errors encountered during OCR processing are logged in ocr_processing.log.
Handling Missing Files: Unsupported file formats or missing files are logged and skipped.

Performance Optimization

To improve performance:

Parallel Processing: Uses ThreadPoolExecutor for concurrent processing of images.
Batch Processing: Processes images in batches for TrOCR to utilize GPU more efficiently.

Output Quality

The preprocess_image function pre-propcesses the images that have very thick and bog letters that typically Tesseract fails to detect.

Selecting Best text among 3 models

The majority_vote, select_best_text and similarity functions compare the putpust of 3 models, and choose the best answer using Levenshtein Distance.

Logging

The script logs detailed information about the processing steps and errors:

Log File: ocr_processing.log
Log Levels: Includes information and error levels.

Future Improvements

Model Fine-Tuning: Fine-tune OCR models on a dataset similar to the target images for better accuracy.
Additional OCR Tools: Explore other OCR tools to enhance text extraction quality like Google Cloud Vision API, Amazon Textract or OpenCV.
Advanced Error Correction: Implement machine learning models for context-based error correction.

yujansaya/raha_beach_ocr