This project focuses on extracting numerical values and their corresponding units from a set of images, each associated with specific entities such as weight, height, or voltage. The goal is to develop a machine learning pipeline that accurately predicts these values and units from the images, overcoming challenges like variable image quality, different text fonts, and OCR (Optical Character Recognition) errors. ๐ง
Our approach involves several key steps:
-
Image Text Extraction using OCR ๐:
- We use EasyOCR with GPU acceleration to extract text from images efficiently.
-
Text Pre-processing ๐งน:
- Clean and normalize the extracted text to reduce noise and correct common OCR misrecognitions.
-
Value and Unit Extraction ๐ฏ:
- Using regular expressions tailored to each entity, we extract numerical values and units from the pre-processed text.
-
Model Evaluation ๐:
- Compare the extracted values and units with the ground truth to evaluate accuracy.
-
Prediction on Test Data ๐ง :
- Apply the pipeline to the test dataset and generate predictions.
- EasyOCR: An open-source OCR tool optimized for GPU usage. It supports multiple languages and efficiently extracts text from images, which is crucial for our task.
- EasyOCR was implemented to extract text from images.
- Observed that raw OCR results often contained noise and misrecognized characters.
- Lowercasing: Converted all text to lowercase for uniformity.
- Character Replacement: Fixed common misrecognized characters (e.g., "O" โ "0", "l" โ "1").
- Removing Unwanted Characters: Kept only alphanumeric characters, percentages, periods, and spaces.
- Whitespace Normalization: Streamlined text by removing extra spaces.
- Used an entity-specific unit map to associate entities with their allowed units.
- Created regex patterns to match numerical values followed by units.
- Implemented fallback mechanisms to handle cases where initial extraction fails by assigning default units based on the entity.
- Introduced a
clean_up_text
function to handle OCR misrecognitions. - Adjusted regex patterns to account for values and units without spaces (e.g., "74m").
- Calculated training accuracy by comparing extracted values and units with ground truth.
- Observed accuracy improvements with each iteration of text pre-processing and extraction enhancements.
Through systematic refinement of our OCR extraction and text pre-processing methods, we've significantly improved the accuracy of numerical value and unit extraction. The final pipeline adeptly handles common OCR errors and variations in text formats, providing a robust solution for entity recognition tasks. โจ