Python: pytesseract does not recognize language Romanian characters on converting PDF files (that contains photocopied images)
me-suzy opened this issue · 1 comments
me-suzy commented
My Python code converts PDF files (that contains photocopied images) into TXT files.
The Problem number one is that pytesseract does not recognize language Romanian characters.
The second problem is that pytesseract doesn't read very good the images. For example ABBYY Fine Reader is much powerfull reader of PDF (with images) such as photocopied books.
import os
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader
# Path to the folder containing PDF files
input_folder = "d:/doc/doc"
# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"
# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]
# Loop through each PDF file and convert it to text using OCR
for file in files:
pdf_path = os.path.join(input_folder, file)
txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")
# Convert PDF pages to images
images = convert_from_path(pdf_path)
# Perform OCR on images and extract text
text = ""
for image in images:
text += pytesseract.image_to_string(image)
# Save the extracted text to a text file
with open(txt_path, "w", encoding="utf-8") as txt_file:
txt_file.write(text)
print("Conversion complete!")
zdenop commented
- We do not support 3rd party projects (pytesseract) - replicate problem with tesseract executable only.
- Learn how to use tesseract and read the documentation (the issue tracker is not for providing support - use tesseract forum)
- Provide all necessary files for replicating the problem.