Cite this corpus using the following:
@software{thibault_clerice_2020_corpus_latin,
author = {Clérice, Thibault},
title = {Corpus Latin antiquité et antiquité tardive lemmatisé},
month = dec,
year = 2020,
publisher = {Zenodo},
version = {0.0.1},
doi = {10.5281/zenodo.4337146},
url = {https://doi.org/10.5281/zenodo.4337146}
}
This corpus contains the whole set of Capitains compliant classical and late Latin texts avaialable out there. The latest version of the corpus is based on the following corpora:
Name | Version | Project you need to cite |
---|---|---|
PerseusDL/canonical-latinLit | 0.0.557 | https://www.perseus.org |
OpenGreekAndLatin/csel-dev | 1.0.67 | https://www.perseus.org |
OpenGreekAndLatin/Latin | v1.10.0 | https://www.perseus.org |
ponteineptique/digiliblt | 0.0.32 | https://digiliblt.uniupo.it |
lascivaroma/priapeia | 1.1.18 | Lasciva Roma |
lascivaroma/additional-texts | 1.0.115 | Lasciva Roma |
The texts are distributed using the same licence as the original, annotation are CC-BY-SA 4.0.
They were tagged with Pie-Extended LASLA model using the 0.0.5a LASLA + model (trained with aligned PROIEL Vulgate as well as Priapea and a Late Latin Corpus to be published soon).
Note: the model is currently being fine-tuned in the context of my PhD. I'll fill this part when it will be done.
- Enclitics duplicate the whole token (-que are not separated). They are identifiable through as their form starts and ends with
{
and}
. Example :
<w lemma="breuis" msd="Case=Abl|Numb=Plur|Gend=Com|Deg=Pos" n="1.18" pos="ADJqua" rend="section">breuibusque</w>
<w lemma="que" msd="MORPH=empty" n="1.18" pos="CON" rend="section">{breuibusque}</w>
-
Roman numbers are lemmatized as Arabic numbers.
-
The model is highly susceptible of wrong annotation for wrong tokens such as
7AP
. -
Tokens ending with
?
are known as needing disambiguation but disambiguation was not possible (there seems to have been an issue with some automatic ones in this version of the corpus).
XML Files are TEI compliant (normally) and the text in separated in passages, their type being provided at the ab
level.
<text n="urn:cts:latinLit:phi0119.phi009.perseus-lat2">
<body>
<ab n="urn:cts:latinLit:phi0119.phi009.perseus-lat2:1" type="line">
Tokens use the standard TEI annotation elements @pos
, @msd
and @lemma
:
<w lemma="gaudeo" msd="Numb=Sing|Mood=Ind|Tense=Pres|Voice=Act|Person=1" n="7" pos="VER" rend="line">gaudeo</w>