/Old-Persian-Dataset

Raw dataset for Old Persian cuneiform

Primary LanguagePythonOtherNOASSERTION

Raw dataset for Old Persian cuneiform

This repository is under developing...

Dear contributors, please be aware that cuneiform languages are different. For instance, the most popular are Elamite, Babylonian and Old Persian; we are working on Old Persian. Below you can see the differences:

types of cuneiform

(Photo is taken from national museum of Iran, the gold plate of king Darius)

Data structure:

/imagedata/

 /source/
        /king/
           source_king_001.jpg
        
  #example:
  
  /behistun/
       /darius_1/
           behistun_darius_1_001.jpg

/textdata/

  /eng_transcription_to_english/
       /metadata/
       eng_transcription_to_english_001.json
       
  /eng_transliteration_to_english/
       /metadata/
       eng_transliteration_to_english_001.json
       
  /single/
      /metadata/
      /eng_transliteration/
            eng_transliteration_001.json

              
   # "single" refers to text data that are just a text without translation 

Translating Old Persian language has some methods, for example, transliteration and transcription. Below you can see an example to know the difference between them:

transliteration_transcription

Metadata

For each directory a "source.metadata.csv" file is provided to see the information of data.

References

Data pipeline

In the first stage, Old Persian cuneiform will be converted to English transcription text as an output using an OCR model. In the second stage, that English transcription text will be the input for an NLP or Large language model (LLM) model to be converted to modern languages. The NLP model performs as a machine translation model

data pipeline (copy)

Glossary

Behistun:بیستون

Susa:شوش

Persepolis:پرسپولیس(تخت جمشید)

Elamite:ایلامی

Babylonian:بابِلی

Cyrus:کوروش

Xerxes:خشایار

Artaxerxes:اردشیر

𐎠𐎢𐎼𐎶𐏀𐎡𐎠:اهورامزدا

LICENSE

This repository is under CC-BY-NC license and any commercial use is prohibited.