Adding dataset Belfort
starride-teklia opened this issue · 5 comments
Hi!
We would like to share a dataset from the Belfort City Council.
Transcriptions are in .txt
format, is this acceptable to you? We have up to four transcriptions for each text-line (two from annotators, two from automatic models) and I am not sure if this is compatible with the PAGE XML format.
The aim of this dataset is to explore strategies for data selection and model training when multiple uncertain transcriptions are available (see our paper).
Here is our dataset YAML file:
schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Belfort
url: https://zenodo.org/record/8041668
authors:
- name: Solène
surname: Tarride
orcid: 0000-0001-6174-9865
- name: Tristan
surname: Faine
- name: Mélodie
surname: Boillet
orcid: 0000-0002-0618-7852
- name: Harold
surname: Mouchère
orcid: 0000-0001-6220-7216
- name: Christopher
surname: Kermorvant
orcid: 0000-0002-7508-4080
institutions: []
description: >-
This dataset includes minutes of Belfort municipal council drawn up between
1790 and 1946. Documents include deliberations, lists of councillors,
convocations, and agendas. The dataset includes 24,105 text-line images that
were automatically detected from pages. Up to 4 transcriptions are available
for each line image: two from humans, and two from automatic models.
project-name: Handwritten Text Recognition from Crowdsourced Annotations
project-website: https://arxiv.org/abs/2306.10878
language:
- fra
production-software: Callico
script:
- iso: Latn
script-type: only-manuscript
time:
notBefore: '1790'
notAfter: '1946'
hands:
count: more-than-10
precision: estimated
license:
- name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
sources:
- reference: >-
Solène Tarride, Tristan Faine, Mélodie Boillet, Harold Mouchère, &
Christopher Kermorvant. (2023). The Belfort dataset: Handwritten Text
Recognition from Crowdsourced Annotations [Data set]. 7th International
Workshop on Historical Doc- ument Imaging and Processing (HIP'23), San
José, California, USA. Zenodo. https://doi.org/10.5281/zenodo.8041668
link: ''
volume:
- metric: lines
count: 24105
Hi @starride-teklia !
I think we already accepted line-level datasets. I need to check why this is not proposed by the form.
Woud you be so kind to clarify in your description where the ground truth lies in the Transcriptions
folder ? That would allow people to more easily use the dataset, potentially without getting surprised at the structure of the zip ?
Using this information, I will count the character volume and add your dataset to HTR-United
Hi @PonteIneptique, thanks for your very quick reply!
Here is the YAML file with the updated description, I hope it is clearer this way:
schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Belfort
url: https://zenodo.org/record/8041668
authors:
- name: Solène
surname: Tarride
orcid: 0000-0001-6174-9865
- name: Tristan
surname: Faine
- name: Mélodie
surname: Boillet
orcid: 0000-0002-0618-7852
- name: Harold
surname: Mouchère
orcid: 0000-0001-6220-7216
- name: Christopher
surname: Kermorvant
orcid: 0000-0002-7508-4080
institutions: []
description: >
This dataset includes minutes of Belfort municipal council drawn up between
1790 and 1946. Documents include deliberations, lists of councillors,
convocations, and agendas. The dataset includes 24,105 text-line images that
were automatically detected from pages.
Up to four transcriptions are available for each line image:
* two from human annotators (in `Transcriptions/callico_1/` and
`Transcriptions/callico_2/`)
* two from automatic models (in `Transcriptions/dan/` and
`Transcriptions/pylaia/`)
project-name: Handwritten Text Recognition from Crowdsourced Annotations
project-website: https://arxiv.org/abs/2306.10878
language:
- fra
production-software: Callico
script:
- iso: Latn
script-type: only-manuscript
time:
notBefore: '1790'
notAfter: '1946'
hands:
count: more-than-10
precision: estimated
license:
- name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
sources:
- reference: >-
Solène Tarride, Tristan Faine, Mélodie Boillet, Harold Mouchère, &
Christopher Kermorvant. (2023). The Belfort dataset: Handwritten Text
Recognition from Crowdsourced Annotations [Data set]. 7th International
Workshop on Historical Doc- ument Imaging and Processing (HIP'23), San
José, California, USA. Zenodo. https://doi.org/10.5281/zenodo.8041668
link: ''
volume:
- metric: lines
count: 24105
Hello! Thank you for your contribution!
We will have to change the value in the format field since it's not PageXML but pairs of line and text.
@PonteIneptique : It will have an impact on the schema because in the current definition, we only allow these 2 values:
"format": {
"description": "Format of the ground truth",
"type": "string",
"enum": ["Alto-XML", "Page-XML"]
},
I think it's time to open a new issue in the schema!
It's now possible ;)
I'll make the PR