Repo for processing LGPSI. For official content, see seumasjeltzz/LinguaeGraecaePerSeIllustrata.
- Seumas Macdonald
- James Tauber
So far, the only data directories being manually modified are:
orig
(the original files from Seumas)manual-data
(data needed for processing, e.g. lemma overrides from Seumas)
The main output directories are:
text
(the processed text in GLTP format)analysis
(further analysis of the text in GLTP format)
Other directories are:
cache
(for storing the Morpheus cache)config
(for storing configuration like for text-validation)scripts
(where all the code lives)
The scripts are run in this order (after dependencies in the Pipfile are installed):
./scripts/orig-to-para.py
converts fromorig
files topara
files undertext
./scripts/para-to-sent.py
converts thosepara
files to sentence-basedsent
files./scripts/add-norm.py
produces thenorm
files inanalysis
from thesent
files./scripts/lemmatise.py
produces thelemma
files inanalysis
from thenorm
files using Morpheus andmanual-data/lemma_overrides.yaml
The folowing are modules not called from the command-line:
morpheus.py
(Morphology API client)utils.py
(common functions shared between scripts)
Other scripts include:
sort-yaml.py <filename>
sorts the given yaml file with top-level keys in alphabetical order
The content is CC-BY-SA and the code is MIT.