/ml-for-ancient-languages

Machine Learning for Ancient Languages

Creative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

Machine Learning for Ancient Languages: A Survey

Thea Sommerschield*, Yannis Assael*, John Pavlopoulos*, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, Jonathan Prag, Ion Androutsopoulos, Nando de Freitas

Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science.

This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitisation, restoration, attribution, linguistic analysis, textual criticism, translation and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the Humanities and Machine Learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, flagging promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the Humanities and Machine Learning.

* T.S., Y.A., J.P. contributed equally to this work.

This repository

This repository serves as a platform to host the taxonomy of the research works we have reviewed, as well as to maintain an up-to-date catalogue of active interdisciplinary Machine Learning projects focused on ancient languages.

Pull requests are highly encouraged. If you are working on or have come across new publications in the field, please feel free to submit a pull request to update the repository, and help us maintain a vibrant and useful catalogue for the community!

Please note that this repository only includes machine learning research for ancient texts written in ancient languages:

  • We consider all languages in use across the world, written on any medium and in any script, between the birth of writing systems in ancient Mesopotamia (3400 BCE) up until the conventional end of 'ancient history' in the late first millennium CE. This remit excludes therefore e.g. modern printed texts in ancient languages, Medieval scribal cultures etc.
  • We do not consider works that did not employ machine learning models, nor those published before the 2000s - owing to space limitations.

Navigation

Citation

@article{sommerschield2023machine,
  author = {Sommerschield*, Thea and Assael*, Yannis and Pavlopoulos*, John and Stefanak, Vanessa and Senior, Andrew and Dyer, Chris and Bodel, John and Prag, Jonathan and Androutsopoulos, Ion and de Freitas, Nando},
  title = {Machine Learning for Ancient Languages: A Survey},
  journal = {Computational Linguistics},
  pages = {1-44},
  year = {2023},
  month = {05},
  issn = {0891-2017},
  doi = {10.1162/coli_a_00481}
}

License

Machine Learning for Ancient Languages by Thea Sommerschield*, Yannis Assael*, John Pavlopoulos*, Vanessa Stefanak, Andrew Senior, Chris Dyer, John Bodel, Jonathan Prag, Ion Androutsopoulos, Nando de Freitas is licensed under CC BY-SA 4.0