/Akkademia

Translating Akkadian signs to transcriptions using NLP techniques such as HMM, MEMM and BiLSTM neural networks.

Primary LanguagePythonMIT LicenseMIT

Akkademia

Akkademia is a tool for automatically transliterating Unicode cuneiform glyphs. It is written in python script and uses HMM, MEMM and BiLSTM neural networks to determine appropriate sign-readings and segmentation.

We trained these algorithms on the RINAP corpora (Royal Inscriptions of the Neo-Assyrian Period), which are available in JSON and XML/TEI formats thanks to the efforts of the Official Inscriptions of the Middle East in Antiquity (OIMEA) Munich Project of Karen Radner and Jamie Novotny, funded by the Alexander von Humboldt Foundation, available here. We achieve accuracy rates of 89.5% with HMM, 94% with MEMM, and 96.7% with BiLSTM on the trained corpora. Our model can also be used on texts from other periods and genres, with varying levels of success.

Getting Started

Akkademia can be accessed in three different ways:

  • Website
  • Python package
  • Github clone

The website and python package are meant to be accessible to people without advanced programming knowledge.

Website

Go to the Babylonian Engine website (under development)

Go to the "Akkademia" tab and follow the instructions there for transliterating your signs.

Python Package

Our python package "akkadian" will enable you to use Akkademia on your local machine.

Prerequisites

You will need a Python 3.7.x installed. Our package currently does not work with other versions of python. You can follow the installation instructions here or go straight ahead to python's downloads page and pick an appropriate version.

Mac comes preinstalled with python 2.7, which may remain the default python version even after installing 3.7.x. To check, type python --version into terminal. If the running version is python 2.7, the simplest short-term solution is to type python3 or pip3 in Terminal throughout instead of python and pip as in the instructions below.

Package Installation

You can install the package using the pip install function. If you do not have pip installed on your computer, or you are not sure whether it is installed or not, you can follow the instructions here

Before installing the package akkadian, you will need to install the torch package. For Windows, copy the following into Command Prompt (CMD):

pip install torch==1.0.0 torchvision==0.2.1 -f https://download.pytorch.org/whl/torch_stable.html

For Mac and Linux copy the following into Terminal:

pip install torch torchvision

Then, type the following in Command Prompt (Windows), or Terminal (Mac and Linux):

pip install akkadian

your installation should be executed. This will take several minutes.

Running

Open a python IDE (Integrated development environment) where a python code can be run. There are many possible IDEs, see realpython's guide or wiki python's list. For beginners, we recommend using Jupyter Notebook: see downloading instructions here, or see downloading instructions and beginners' tutorial here.

First, import akkadian.transliterate into your coding environment:

import akkadian.transliterate as akk

Then, you can use HMM, MEMM, or BiLSTM to transliterate the signs. The functions are:

akk.transliterate_hmm("Unicode_signs_here")
akk.transliterate_memm("Unicode_signs_here")
akk.transliterate_bilstm("Unicode_signs_here")
akk.transliterate_bilstm_top3("Unicode_signs_here")

akk.transliterate_bilstm_top3 gives the top three BiLSTM options, while akk.transliterate_bilstm gives only the top one.

For an immediate output of the results, put the akk.transliterate() function inside the print() function. Here are some examples with their output:

print(akk.transliterate_hmm("π’ƒ»π’…˜π’€π’„Ώπ’ˆ¬π’Š’π’…–π’²π’ˆ π’€€π’‹Ύ"))
Ε‘aβ‚‚ nak-ba-i-mu-ru iΕ‘-di-ma-a-ti
print(akk.transliterate_memm("π’ƒ»π’…˜π’€π’„Ώπ’ˆ¬π’Š’π’…–π’²π’ˆ π’€€π’‹Ύ"))
Ε‘aβ‚‚ SILIM ba-i-mu-ru-iΕ‘-di-ma-a-ti
print(akk.transliterate_bilstm("π’ƒ»π’…˜π’€π’„Ώπ’ˆ¬π’Š’π’…–π’²π’ˆ π’€€π’‹Ύ"))
Ε‘aβ‚‚ nak-ba-i-mu-ru iΕ‘-di-ma-a-ti 
print(akk.transliterate_bilstm_top3("π’ƒ»π’…˜π’€π’„Ώπ’ˆ¬π’Š’π’…–π’²π’ˆ π’€€π’‹Ύ"))
('Ε‘aβ‚‚ nak-ba-i-mu-ru iΕ‘-di-ma-a-ti ', 'Ε‘aβ‚‚-di-ba i mu ru-iΕ‘ di ma tukul-tu ', 'MUN kis BA Ε‘e-MU-Ε‘ub-Ε‘ah-αΉ­i-nab-nu-ti-')

This line was taken from the first line of the Epic of Gilgamesh: Ε‘aβ‚‚ naq-ba i-mu-ru iΕ‘-di ma-a-ti; "He who saw the Deep, the foundation of the country" (George, A.R. 2003. The Babylonian Gilgamesh Epic: Introduction, Critical Edition and Cuneiform Texts. 2 vols. Oxford: Oxford University Press). Although the algorithms were not trained on this text genre, they show promising, useful results.

Github

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You will need a Python 3.7.x installed. Our package currently does not work with other versions of python. Go to python's downloads page and pick an appropriate version.

If you don't have git installed, install git here (Choose the appropriate operating system).

If you don't have a Github user, create one here.

Installing the python dependencies

In order to run the code, you will need the torch and allennlp libraries. If you have already installed the package akkadian, these were installed on your computer and you can skip to the next step.

Install torch: For Windows, copy the following to Command Prompt

pip install torch===1.3.1 torchvision===0.4.2 -f https://download.pytorch.org/whl/torch_stable.html

for Mac and Linux, copy the following to Terminal

pip install torch torchvision

Install allennlp: copy the following to Command Prompt (with windows) or Terminal (with mac):

pip install allennlp==0.8.5

Cloning the project

Copy the following into Command Prompt (with windows) or Terminal (with mac) to clone the project:

git clone https://github.com/gaigutherz/Akkademia.git

Running

Now you can develop the Akkademia repository and add your improvements!

Training

Use the file train.py in order to train the models using the datasets. There is a function for each model that trains, stores the pickle and tests its performance on a specific corpora.

The functions are as follows:

hmm_train_and_test(corpora)
memm_train_and_test(corpora)
biLSTM_train_and_test(corpora)

Transliterating

Use the file transliterate.py in order to transliterate using the models. There is a function for each model that takes Unicode cuneiform signs as parameter and returns its transliteration.

Example of usage:

cuneiform_signs = "π’ƒ»π’…˜π’€π’„Ώπ’ˆ¬π’Š’π’…–π’²π’ˆ π’€€π’‹Ύ"
print(transliterate(cuneiform_signs))
print(transliterate_bilstm(cuneiform_signs))
print(transliterate_bilstm_top3(cuneiform_signs))
print(transliterate_hmm(cuneiform_signs))
print(transliterate_memm(cuneiform_signs))

Datasets

For training the algorithms, we used the RINAP corpora (Royal Inscriptions of the Neo-Assyrian Period), which are available in JSON and XML/TEI formats thanks to the efforts of the Humboldt Foundation-funded Official Inscriptions of the Middle East in Antiquity (OIMEA) Munich Project led by Karen Radner and Jamie Novotny, available here. The current output in our website, package and code is based on training done on these corpora alone.

For additional future training, we added the following corpora (in JSON file format) to the repository:

These corpora were all prepared by the Munich Open-access Cuneiform Corpus Initiative (MOCCI) and OIMEA project teams, both led by Karen Radner and Jamie Novotny, and are fully accessible for download in JSON or XML/TEI format in their respective project webpages (see left side-panel on project webpages and look for project-name downloads).

We also included a separate dataset which includes all the corpora in XML/TEI format.

Datasets deployment

All the dataset are taken from their respective project webpages (see left side-panel on project webpages and look for project_name downloads) and are fully accessible from there.

In our repository the datasets are located in the "raw_data" directory. They can also be downloaded from the Github repository using git clone or zip download.

Project structure

BiLSTM_input:

Contains dictionaries used for transliteration by BiLSTM.

NMT_input:

Contains dictionaries used for natural machine translation.

akkadian.egg-info:

Information and settings for akkadian python package.

akkadian:

Sources and train's output.

output:	Train's output for HMM, MEMM and BiLSTM - mostly pickles.
	
__init__.py: Init script for akkadian python package. Initializes global variables.

bilstm.py: Class for BiLSTM train and prediction using AllenNLP implementation.

build_data.py: Code for organizing the data in dictionaries.

check_translation.py: Code for translation accuracy checking.

combine_algorithms.py: Code for prediction using both HMM, MEMM and BiLSTM.

data.py: Utils for accuracy checks and dictionaries interpretations.

full_translation_build_data.py: Code for organizing the data for full translation task.

get_texts_details.py: Util for getting more information about the text.

hmm.py: Implementation of HMM for train and prediction.

memm.py: Implementation of MEMM for train and prediction.

parse_json: Json parsing used for data organizing.

parse_xml.py: XML parsing used for data organizing.

train.py: API for training all 3 algorithms and store the output.

translation_tokenize.py: Code for tokenization of translation task.

transliterate.py: API for transliterating using all 3 algorithms.

build/lib/akkadian:

Information and settings for akkadian python package.

dist:

Akkadian python package - wheel and tar.

raw_data:

Databases used for training the models:

RINAP 1, 3-5

Additional databases for future training:
	
RIAO
	
RIBO
	
SAAO
	
SUHU
	
Miscellanea:

tei - the same databases (RINAP, RIAO, RIBO, SAAO, SUHU) in XML/TEI format.

random - 4 texts used for testing texts outside of the training corpora. They were randomly selected from RIAO and RIBO.

Licensing

This repository is made freely available under the Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license. This means you are free to share and adapt the code and datasets, under the conditions that you cite the project appropriately, note any changes you have made to the original code and datasets, and if you are redistributing the project or a part thereof, you must release it under the same license or a similar one.

For more information about the license, see here.

Issues and Bugs

If you are experiencing any issues with the website, the python package akkadian or the git repository, please contact us at dhl.arieluni@gmail.com, and we would gladly assist you. We would also much appreciate feedback about using the code via the website or the python package, or about the repository itself, so please send us any comments or suggestions.

Authors

  • Gai Gutherz
  • Ariel Elazary
  • Avital Romach
  • Shai Gordin

This research was supported by the Ministry of Science & Technology ,Israel.