This project extracts text from a set of 75 images using three different Optical Character Recognition (OCR) tools: EasyOCR, PyTesseract, and TrOCR. The results are then analyzed to identify discrepancies, and a consensus text is determined and corrected for common OCR errors. The final output is saved as a JSON file.
- Project Structure
- Dependencies
- Installation
- Usage
- Error Handling
- Performance Optimization
- Output Quality
- Logging
- Future Improvements
.
├── selected_images/ # Directory containing the 75 images
├── ocr_results.json # Output JSON file with extracted and corrected text
├── ocr_processing.log # Log file for tracking processing details and errors
├── main.py # Main script for executing OCR and text processing
└── README.md # This readme file
The project requires the following libraries:
Pillow
(PIL)easyocr
pytesseract
transformers
(for TrOCR)torch
concurrent.futures
tqdm
Levenshtein
(for choosing the best text among 3 models)
Ensure you have the required tools installed:
- Tesseract OCR: Tesseract Installation Guide
- Python 3.8 or higher
-
Clone the repository:
git clone https://github.com/yujansaya/raha_beach_ocr/ cd <your repository-directory>
-
Install the Python dependencies:
pip install -r requirements.txt
-
Ensure Tesseract OCR is installed:
- On Linux:
sudo apt-get install tesseract-ocr
- On macOS:
brew install tesseract
- On Windows: Download the Tesseract installer
- On Linux:
-
Prepare the input images:
- Place the 75 images in the
selected_images/
directory.
- Place the 75 images in the
-
Run the main script:
python main.py
-
View the results:
- The extracted and corrected text will be saved in
ocr_results.json
. - Logs can be found in
ocr_processing.log
.
- The extracted and corrected text will be saved in
python main.py
The script includes comprehensive error handling:
- Logging Errors: Errors encountered during OCR processing are logged in
ocr_processing.log
. - Handling Missing Files: Unsupported file formats or missing files are logged and skipped.
To improve performance:
- Parallel Processing: Uses
ThreadPoolExecutor
for concurrent processing of images. - Batch Processing: Processes images in batches for TrOCR to utilize GPU more efficiently.
The preprocess_image
function pre-propcesses the images that have very thick and bog letters that typically Tesseract fails to detect.
The majority_vote
, select_best_text
and similarity
functions compare the putpust of 3 models, and choose the best answer using Levenshtein Distance.
The script logs detailed information about the processing steps and errors:
- Log File:
ocr_processing.log
- Log Levels: Includes information and error levels.
- Model Fine-Tuning: Fine-tune OCR models on a dataset similar to the target images for better accuracy.
- Additional OCR Tools: Explore other OCR tools to enhance text extraction quality like Google Cloud Vision API, Amazon Textract or OpenCV.
- Advanced Error Correction: Implement machine learning models for context-based error correction.