This is an English to Assyrian/Eastern Syriac machine translation model, it uses English to Arabic model as the base model. The source code is well documented, and was made such that it can be read by inexperienced developers.
Although the project aim is to Build a English to Assyrian - the ones that fall under Northeastern Neo-Aramaic - the current model mostly provides translation for Classical Syriac. This model is a good initial step, but I hope future work will make it more inline with Assyrian dialects.
Please note that Assyrian and Easter Syriac are used interchangeably.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
tokenizer = AutoTokenizer.from_pretrained("mt-empty/english-assyrian")
model = AutoModelForSeq2SeqLM.from_pretrained("mt-empty/english-assyrian")
translator = pipeline("translation", model=model, tokenizer=tokenizer)
print("tomorrow morning", translator("tomorrow morning"))
test.py/test.ipynb contains examples on how to use the translation pipeline.
The dataset are sourced from:
SentencePiece was used for tokenization.
SacreBlue was used for evaluation. Running it for 50 epochs produced a score of 33.
Please make sure you have installed all the required dependencies,
Run python model.py
, it will train for 50
epochs, this can be changed in the code.
This project utilizes pre-commit hooks, so please run the following before submitting a pull request:
- Install requirements,
pip install -r requirements.txt
- Configure pre-commit hooks,
pre-commit install
- (Optional) Run hooks manually,
pre-commit run --all-files
- Submit a pull request
datasets
transformers
sentencepiece
pandas
pytorch
sacrebleu
Please install the appropriate version of pytorch for your machine, cuda
is needed if you want to train on GPU.