/LAGT

ETL repo for ancient Greek texts

Primary LanguageJupyter NotebookCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

DOI

LAGT: Lemmatized Ancient Greek Texts

Citation

Vojtěch Kaše. (2024). LAGT (v3.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10684841

Authors

(With contributions from many collaborators and colleagues: Tomáš Glomb, Vojtěch Linka, Viktor Zavřel, Nina Nikki, Zdeňka Špiclová etc.)

Description

This repository serves for extraction, merging, cleaning morphological analysis and aggregation of publicly available ancient Greek texts accessible via two GitHub repositories:

Concerning lemmatization, the dataset contains lemmatized sentences in a form of list-of-lists, with sublist elements representing individual lemmata. It contains only nouns, proper names, verbs and adjectives. Wherever available, the lemmata are based on avaialable Treebank data, such as the GLAUx corpus (see below). Where not, the GreCy model for spaCy is employed for automatic annotation.

In version v3.0 it includes 1,710 works from more than 325 authors, covering 32,323,612 tokens of raw text. It covers only works from the period from the 8th c. BCE to the 6th c. CE.

LAGT delivers all the data and metadata as one large tabular object, available for download either as a json or parquet file, which might be loaded directly into a Python environment as a dataframe using the Pandas library.

Individual works are represented by rows and columns represent attributes, such as the author ID (“doc_id”, e.g. “tlg0086”) and document ID (“doc_id”, e.g. “tlg010”) inherited from the source corpora, the date of creation expressed by means of an interval (“not_before” and “not_after”), manually annotated religious provenience as either pagan, Jewish or Christian (“provenience” attribute) etc., which allow various forms of sorting and filtering. The dating information is coded by means of the “not_before” and “not_after” attributes on the level of authors and with the precision of centuries.


Lemmatization v3.0

The lemmata for individual documents come from several sources. See the column "lemmata_source":

Software

  1. Python 3
  2. Jupyter Lab/Hub/Notebooks (Jupyter notebooks files)

License

CC-BY-SA 4.0, see attached License.md

Footnotes

  1. Gorman, V. B. (2020). Dependency Treebanks of Ancient Greek Prose. Journal of Open Humanities Data, 6(1), 1. https://doi.org/10.5334/johd.13

  2. Keersmaekers, A. (2021). The GLAUx corpus: Methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek. Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, 39–50. https://doi.org/10.18653/v1/2021.lchange-1.6