
Extraction ALTO4 to transform in TEI p5 encoding

Primary LanguageHTMLMIT LicenseMIT

TEI transformation


CLI to convert OCR data in XML-ALTO format to TEI p5 encoding from a specific SegmOnto ontology.

It takes over and modifies the ALTO2TEI project developed by Kelly Christensen in the Gallic(orpor)a project.

The works as follows: Order and group the files within input/ -> parse the ALTO files -> build the <teiHeader> (with database/) -> build the <sourceDesc> -> build the <text> according to the ontology


With the -e option enabled, the XML-TEI files are then parsed with to detect the following entities with the NER model:

  • <persName>
  • <orgName>
  • <date>
  • <placeName>

An Entity-Fishing system has been implemented in order to be able to query (sparql) places and people on Wikidata databases in src/enrichment/.


├── data
│   ├── database
 |    |   ├── database
 |    |    |          ├── ODD_Letters.html
 |    |    |          └── ODD_Letters.xml
│   │   ├── araucania_inventory.csv
│   │   ├── entities.csv
│   │   ├── logs.txt
│   │   ├──resp.json
│   │   └── schema_letter.rng
│   ├── models
│   │   └── araucania_NER_model
|   |
│   ├── input
│   └── output
├── requirements.txt
├── run.py
└── src
    ├── __init__.py
    ├── build_body.py
    ├── build_enrich.py
    ├── build_tei.py
    ├── enrichment
    │   ├── __init__.py
    │   ├── nlp.py
    │   ├── query.py
    │   └── sparql.py
    ├── opt
    │   ├── __init__.py
    │   ├── extract_alto.py
    │   ├── inventory.py
    │   ├── surface_and_desc.py
    │   └── utils.py
    ├── sourceDoc_attributes.py
    ├── sourceDoc.py
    ├── teiheader.py
    └── text_data.py