Sample OCR

Set up your environment:
- Install Python (if not already installed).
- Install required libraries: Pillow for image handling, pytesseract for OCR, and opencv-python for image processing.
```
python3 -m venv .venv
source .venv/bin/activate
```
Install Tesseract:
- Download and install Tesseract from here.
Write the Python script:

Here’s a detailed script to get you started:

Step 1: Install the necessary libraries

pip install pillow pytesseract opencv-python

Step 2: Write the Python code

import cv2
import pytesseract
from PIL import Image
import numpy as np

# Path to the tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Update this path based on your installation

def preprocess_image(image_path):
    # Read the image using OpenCV
    image = cv2.imread(image_path)
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Apply thresholding
    _, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    
    # Optionally, apply additional processing like noise reduction
    # blur = cv2.GaussianBlur(thresh, (1, 1), 0)
    
    return thresh

def extract_text_from_image(image_path):
    # Preprocess the image
    processed_image = preprocess_image(image_path)
    
    # Use pytesseract to extract text
    text = pytesseract.image_to_string(processed_image)
    
    return text

if __name__ == "__main__":
    # Path to the image
    image_path = 'path_to_your_image.jpg'  # Update with your image path

    # Extract text
    text = extract_text_from_image(image_path)
    print("Extracted Text:\n", text)

Step 3: Explanation of the Code

Importing Libraries:
- cv2 is OpenCV for image processing.
- pytesseract is the Python wrapper for Tesseract.
- PIL (Python Imaging Library) for additional image handling.
- numpy for numerical operations (used by OpenCV).
Preprocessing the Image:
- Load the image using OpenCV.
- Convert the image to grayscale, which is often more suitable for OCR.
- Apply thresholding to binarize the image, improving the contrast between the text and the background.
Extracting Text:
- pytesseract.image_to_string() is used to perform OCR on the preprocessed image.
Running the Script:
- Update the path to your image and the Tesseract executable.
- Run the script to see the extracted text from the image.

Tips for Better Accuracy

Experiment with different preprocessing techniques like denoising, blurring, and edge detection to improve OCR accuracy.
Ensure the image is clear and the text is legible.
Adjust the thresholding parameters if necessary.

thanhtunguet/sample-ocr

Sample OCR

Step 1: Install the necessary libraries

Step 2: Write the Python code

Step 3: Explanation of the Code

Tips for Better Accuracy