/moonshines

A single-hand set of French GT based on texts copied from Alcools

Creative Commons Attribution 4.0 InternationalCC-BY-4.0

Moonshines

CC BY 4.0

characters badge regions badge lines badge files badge

License

This dataset is published under the CC-BY 4.0 License.

To cite this dataset:

Chagué, A. (2023). moonshines (Version 2.0.0) [Data set]. https://github.com/alix-tz/moonshines

Description

This dataset is composed of pages of text written in 2023 by a single person, copying texts taken from Guillaume Apollinaire's poems published in Alcools.

The dataset is divided into two parts:

  • data/ which is intended to train transcription models,
  • test/ which is intended for test.

Transcription guidelines

The transcription strictly follows what is written on the images, including accentuation or capitalization errors.

The segmentation follows the SegmOnto ontology and mostly relies on MainZone and DefaultLine.

Possible limitations

Since the text follows the structure of Alcools, there is almost no ponctuation in this ground truth. Besides, most of the lines start with a capital letter.