/grobid

Automatic XML TEI encoding of catalogues using GROBID technologies

Primary LanguageCSS

Automatic encoding of manuscripts catalogues with GROBID

Originally designed for dictionaries, we are trying to use GROBID with manuscripts catalogues.

Credits

GROBID dictionaries is developed by Mohamed Khemakhem (GitHub).

More info on GROBID technologies can be found here.

Research on catalogues and training is carried by Simon Gabay.

Corpus

Tests are carried on scans of the Revue des autographes, directes by Gabriel Charavay (data.bnf)

Methodology

PDF are OCRised with Transkribus. You can ask for our model.

The GROBID model is trained on four excerpts (three pages each) of the corpus (toyData/dataset/dictionary-segmentation/corpus>PDF).

Files

Training data are available in ToyData

Samples of pdf and tools to manipulate them (cpdf) are in TrainingTools

Paper

A first paper was presented at the TEI 2018 in Tokyo

@inproceedings{khemakhem:hal-01819505,
  TITLE = {{Automatically Encoding Encyclopedic-like Resources in TEI}},
  AUTHOR = {Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini, Giancarlo Luxardo},
  URL = {https://hal.archives-ouvertes.fr/hal-01819505},
  BOOKTITLE = {{TEI 2018}},
  ADDRESS = {Tokyo, Japan},
  YEAR = {2018},
  MONTH = September,
  KEYWORDS = {Manuscripts auction catalogues, GROBID-Dictionaries, TEI, Dictionaries},
  PDF = {https://hal.inria.fr/hal-01819505/document},
  HAL_ID = {hal-01819505},
}

Licence

Regarding GROBID, cf. here.

Regarding the corpus: extracted data is CC-BY.

Creative Commons License