HTR-United/htr-united

Adding dataset Belfort

starride-teklia opened this issue · 5 comments

Hi!

We would like to share a dataset from the Belfort City Council.

Transcriptions are in .txt format, is this acceptable to you? We have up to four transcriptions for each text-line (two from annotators, two from automatic models) and I am not sure if this is compatible with the PAGE XML format.

The aim of this dataset is to explore strategies for data selection and model training when multiple uncertain transcriptions are available (see our paper).

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Belfort
url: https://zenodo.org/record/8041668
authors:
 - name: Solène
   surname: Tarride
   orcid: 0000-0001-6174-9865
 - name: Tristan
   surname: Faine
 - name: Mélodie
   surname: Boillet
   orcid: 0000-0002-0618-7852
 - name: Harold
   surname: Mouchère
   orcid: 0000-0001-6220-7216
 - name: Christopher
   surname: Kermorvant
   orcid: 0000-0002-7508-4080
institutions: []
description: >-
 This dataset includes minutes of Belfort municipal council drawn up between
 1790 and 1946. Documents include deliberations, lists of councillors,
 convocations, and agendas. The dataset includes 24,105 text-line images that
 were automatically detected from pages. Up to 4 transcriptions are available
 for each line image: two from humans, and two from automatic models.
project-name: Handwritten Text Recognition from Crowdsourced Annotations
project-website: https://arxiv.org/abs/2306.10878
language:
 - fra
production-software: Callico
script:
 - iso: Latn
script-type: only-manuscript
time:
 notBefore: '1790'
 notAfter: '1946'
hands:
 count: more-than-10
 precision: estimated
license:
 - name: CC-BY 4.0
   url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
sources:
 - reference: >-
     Solène Tarride, Tristan Faine, Mélodie Boillet, Harold Mouchère, &
     Christopher Kermorvant. (2023). The Belfort dataset: Handwritten Text
     Recognition from Crowdsourced Annotations [Data set]. 7th International
     Workshop on Historical Doc- ument Imaging and Processing (HIP'23), San
     José, California, USA. Zenodo. https://doi.org/10.5281/zenodo.8041668
   link: ''
volume:
 - metric: lines
   count: 24105

Hi @starride-teklia !
I think we already accepted line-level datasets. I need to check why this is not proposed by the form.

Woud you be so kind to clarify in your description where the ground truth lies in the Transcriptions folder ? That would allow people to more easily use the dataset, potentially without getting surprised at the structure of the zip ?

Using this information, I will count the character volume and add your dataset to HTR-United

Hi @PonteIneptique, thanks for your very quick reply!

Here is the YAML file with the updated description, I hope it is clearer this way:

schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Belfort
url: https://zenodo.org/record/8041668
authors:
  - name: Solène
    surname: Tarride
    orcid: 0000-0001-6174-9865
  - name: Tristan
    surname: Faine
  - name: Mélodie
    surname: Boillet
    orcid: 0000-0002-0618-7852
  - name: Harold
    surname: Mouchère
    orcid: 0000-0001-6220-7216
  - name: Christopher
    surname: Kermorvant
    orcid: 0000-0002-7508-4080
institutions: []
description: >
  This dataset includes minutes of Belfort municipal council drawn up between
  1790 and 1946. Documents include deliberations, lists of councillors,
  convocations, and agendas. The dataset includes 24,105 text-line images that
  were automatically detected from pages. 

  Up to four transcriptions are available for each line image: 

  * two from human annotators (in `Transcriptions/callico_1/` and
  `Transcriptions/callico_2/`)

  * two from automatic models (in `Transcriptions/dan/` and
  `Transcriptions/pylaia/`) 
project-name: Handwritten Text Recognition from Crowdsourced Annotations
project-website: https://arxiv.org/abs/2306.10878
language:
  - fra
production-software: Callico
script:
  - iso: Latn
script-type: only-manuscript
time:
  notBefore: '1790'
  notAfter: '1946'
hands:
  count: more-than-10
  precision: estimated
license:
  - name: CC-BY 4.0
    url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
sources:
  - reference: >-
      Solène Tarride, Tristan Faine, Mélodie Boillet, Harold Mouchère, &
      Christopher Kermorvant. (2023). The Belfort dataset: Handwritten Text
      Recognition from Crowdsourced Annotations [Data set]. 7th International
      Workshop on Historical Doc- ument Imaging and Processing (HIP'23), San
      José, California, USA. Zenodo. https://doi.org/10.5281/zenodo.8041668
    link: ''
volume:
  - metric: lines
    count: 24105

Hello! Thank you for your contribution!

We will have to change the value in the format field since it's not PageXML but pairs of line and text.

@PonteIneptique : It will have an impact on the schema because in the current definition, we only allow these 2 values:

    "format": {
        "description": "Format of the ground truth",
        "type": "string",
        "enum": ["Alto-XML", "Page-XML"]
    },

I think it's time to open a new issue in the schema!

It's now possible ;)
I'll make the PR