/Old-Persian-Cuneiform-OCR

OCR on Old Persian cuneiform (Achaemenid language)

Primary LanguageJupyter NotebookOtherNOASSERTION

The aim of this repository is creating an OCR model (convert image to text) for Old Persian Cuneiform

This repository is inspired from eBL project and is a part of Electronic Old Persian Library organization.

eBL has developed models for Babylonian cuneiform but I am going to develop my models for Old Persian cuneiform.

Three OCR models are developed in this repository:

  • yolo_cnn_old_persian
  • tessearct_old_persian
  • easyocr_old_persian

Current status of these 3 OCR models:

  • yolo_cnn_old_persian : is not completed yet.
  • tessearct_old_persian is completed.
  • easyocr_old_persian is completed but needs more optimization and real data.

Melanee: Since I was looking for an OCR model for Old Persian language, I have not implemented image pre-processing for my models yet and they work on just black and white images. You can use custom images to use my OCR models.

easyocr_old_persian

This model is still under developing and is based on EasyOCR repository for a custum model. If you see any error please check issues

Trainer notebook:

https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR/blob/master/easyocr_old_persian/trainer_easyocr_old_persian.ipynb

Using saved model:

https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR/blob/master/easyocr_old_persian/model_easyocr.ipynb

To use saved model please create the root of your machine like below structure and replace custum_example.pth, custom_example.py and custom_example.yaml files there. For more comprehension please watch this tutorial on youtube.

/root/

 /EasyOCR/
       /model/
           custum_example.pth
       /user_network/
           custom_example.py
           custom_example.yaml

tessearct_old_persian:

https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR/blob/main/Tesseract_Old_Persian_OCR.ipynb

Please replace myLang.traineddata file in this directory: /usr/share/tesseract-ocr/4.00/tessdata

This tesseract pre-trained OCR model deciphers Old Persian cuneiform to English transcription and is developed by S. Muhammad Hossein Mousavi. Tesseract is one of the most powerful OCR models in the world.

An example:

The last 12 lines of the great Darius's inscription in Persepolis, DPd inscription:

Input:

darius2

Output:

Zatiy ; daryvuS ; xSayZiy;

mna;aurmzda;upstam; blauv;

hda ; ViZibiS ; bgibiS ; uta;

imam;dhyaum;aulmzda;

paTuv;hca;hinaya; hca;

QuSiyala ; hca;druga;abiy;

imam ;dhyaum;ma; ajMiya; ait;

aim ;yanm;jDiyaMiy;

aitmiy ; ddaTuv

At the next step, you can translate that Old Persian transcription to modern Persian by Chat-GPT:

این منم داریوش شاهنشاه؛ به لطف اهورامزدا، من این را بنا کردم؛ من این امپراتوری را بنیان نهادم و آن را نیرومند ساختم. باشد که اهورامزدا من و پادشاهی مرا محافظت کند؛ باشد که برای همیشه پایدار بماند؛ و باشد که از دروغ در امان باشد؛ این است آنچه من انجام دادم؛

این است آنچه من می‌گویم.

Notice

This repository is still under developing. For contributing contact me by email: melaneepython@gmail.com

To create pull requests for this repository please choose just these branches: issue, refactor and feature.