Adding e-NDP dataset
Lucaterre opened this issue · 3 comments
Hi HTR-United team!
Thank you again for your open initiative !
There is a new submission for the e-NDP dataset that we have already referenced on Zenodo in the context of the e-NDP ANR project.
I hope my description is correct, let me know if I need to change anything.
Here is our dataset YAML file:
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: eNDP-ground-truth
url: https://zenodo.org/records/7575693
authors:
- name: Julie
surname: Claustre
orcid: 0000-0001-8504-3920
roles:
- transcriber
- project-manager
- name: Darwin
surname: Smith
roles:
- transcriber
- project-manager
- name: Sergio
surname: Torres Aguilar
orcid: 0000-0002-1801-3147
roles:
- aligner
- quality-control
- support
- name: Isabelle
surname: Bretthauer
orcid: 0000-0002-1780-772X
roles:
- transcriber
- name: Pierre
surname: Brochard
orcid: 0000-0003-1955-556X
roles:
- quality-control
- name: Olivier
surname: Canteaut
orcid: 0000-0003-4586-1931
roles:
- transcriber
- quality-control
- name: Emilie
surname: Cottereau
orcid: 0000-0001-6880-2112
roles:
- transcriber
- name: Fabrice
surname: Delivré
roles:
- transcriber
- name: Mathilde
surname: Denglos
roles:
- transcriber
- name: Vincent
surname: Jolivet
orcid: 0000-0003-0600-0362
roles:
- aligner
- quality-control
- support
- name: Véronique
surname: Julerot
roles:
- transcriber
- name: Thierry
surname: Kouamé
orcid: 0000-0001-9728-2988
roles:
- transcriber
- name: Elisabeth
surname: Lusset
orcid: 0000-0003-1572-1890
roles:
- transcriber
- name: Anne
surname: Massoni
orcid: 0000-0002-1690-9804
roles:
- transcriber
- name: Sebastien
surname: Nadiras
roles:
- transcriber
- name: Nicolas
surname: Perreaux
orcid: 0000-0002-0103-817X
roles:
- transcriber
- name: Hugo
surname: Regazzi
orcid: 0000-0002-3059-2874
roles:
- transcriber
- name: Mathilde
surname: Treglia
roles:
- transcriber
institutions: []
description: >-
The e-NDP project : collaborative digital edition of the Chapter registers of
Notre-Dame of Paris (1326-1504). Ground-truth for handwriting text recognition
(HTR) on late medieval manuscripts.
project-name: >-
The e-NDP project : collaborative digital edition of the Chapter registers of
Notre-Dame of Paris (1326-1504). Ground-truth for handwriting text recognition
(HTR) on late medieval manuscripts.
project-website: https://endp.hypotheses.org/presentation
language:
- fra
- lat
production-software: eScriptorium + Kraken
automatically-aligned: true
script:
- iso: Latn
qualify: cursive
script-type: only-manuscript
time:
notBefore: '1326'
notAfter: '1504'
hands:
count: more-than-10
precision: estimated
license:
name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
volume:
- metric: pages
count: 512
- metric: lines
count: 34231
- metric: characters
count: 3320407
- metric: files
count: 512
- metric: regions
count: 2448
transcription-guidelines: >-
- The abbreviations have been resolved, both those by suspension (facimꝰ --->
facimus) and by contraction (dñi --> domini). Likewise, those using
conventional signs (⁊ --> et ; ꝓ --> pro) have been resolved.
- The named entities (names of persons, places and institutions) have been
capitalized. The beginning of a block of text as well as the original capitals
used by the notary are also capitalized.
The consonantal i and u characters have been transcribed as j and v in both
French and Latin.
- The punctuation marks used in the text: . and / have been transcribed, but
the transcription has not been standardized with modern punctuation.
- Corrections and words that appear cancelled in the manuscript have been
transcribed surrounded by the sign $ at the beginning and at the end.
- More specific transcription rules can be found into the file
transcription_guidelines.pdf on Zenodo repository.
Hello Lucas,
Thank you for the contribution!
I have 3 suggestions:
- for
project-name
you put:
project-name: >-
The e-NDP project : collaborative digital edition of the Chapter registers of
Notre-Dame of Paris (1326-1504). Ground-truth for handwriting text recognition
(HTR) on late medieval manuscripts.
I think "e-NDP project" is enough, or at least this entry doesn't need the "Ground-truth for handwriting text recognition (HTR) on late medieval manuscripts" part.
- for
description
, you put:
description: >-
The e-NDP project : collaborative digital edition of the Chapter registers of
Notre-Dame of Paris (1326-1504). Ground-truth for handwriting text recognition
(HTR) on late medieval manuscripts.
You could provide more details (consider someone browsing through the HTR-United catalog and trying to get a good understanding of the different datasets).
- for
title
you can keep "eNDP-ground-truth", but you could also consider giving it a more natural language form (even if it is just "eNDP Ground Truth").
I created a pull request (#153) so feel free to modify the yml file directly if you want to make any change!
Thank you again!
Hi @alix-tz,
Thank you very much for your reply and for PR #153!
Here is my YAML file with updated fields:
schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: ANR e-NDP Ground Truth
url: https://zenodo.org/records/7575693
authors:
- name: Julie
surname: Claustre
orcid: 0000-0001-8504-3920
roles:
- transcriber
- project-manager
- name: Darwin
surname: Smith
roles:
- transcriber
- project-manager
- name: Sergio
surname: Torres Aguilar
orcid: 0000-0002-1801-3147
roles:
- aligner
- quality-control
- support
- name: Isabelle
surname: Bretthauer
orcid: 0000-0002-1780-772X
roles:
- transcriber
- name: Pierre
surname: Brochard
orcid: 0000-0003-1955-556X
roles:
- quality-control
- name: Olivier
surname: Canteaut
orcid: 0000-0003-4586-1931
roles:
- transcriber
- quality-control
- name: Emilie
surname: Cottereau
orcid: 0000-0001-6880-2112
roles:
- transcriber
- name: Fabrice
surname: Delivré
roles:
- transcriber
- name: Mathilde
surname: Denglos
roles:
- transcriber
- name: Vincent
surname: Jolivet
orcid: 0000-0003-0600-0362
roles:
- aligner
- quality-control
- support
- name: Véronique
surname: Julerot
roles:
- transcriber
- name: Thierry
surname: Kouamé
orcid: 0000-0001-9728-2988
roles:
- transcriber
- name: Elisabeth
surname: Lusset
orcid: 0000-0003-1572-1890
roles:
- transcriber
- name: Anne
surname: Massoni
orcid: 0000-0002-1690-9804
roles:
- transcriber
- name: Sebastien
surname: Nadiras
roles:
- transcriber
- name: Nicolas
surname: Perreaux
orcid: 0000-0002-0103-817X
roles:
- transcriber
- name: Hugo
surname: Regazzi
orcid: 0000-0002-3059-2874
roles:
- transcriber
- name: Mathilde
surname: Treglia
roles:
- transcriber
institutions: []
description: >-
This repository hosts HTR ground truth created within the context of the ANR e-NDP project.
This dataset based on 512 pages from the 26 registers of the Notre-Dame de Paris cathedral chapter.
The volumes containing the chapter conclusions were conceived to serve as memorial records, but above all as documents for regular use and consultation in the daily practice of administration and management.
The registers were written using a Cursive script (ca. late XIIIe - XVIe) and their content is were written mainly in Latin, the
rest in French. There are no fewer than 18 hands in these pages.
The transcriptions were manually completed in two rounds by a group of 12 contributors, including historians and paleographers, over the course of 2021-2022 using eScriptorium.
project-name: >-
ANR e-NDP
project-website: https://endp.hypotheses.org/presentation
language:
- fra
- lat
production-software: eScriptorium + Kraken
automatically-aligned: true
script:
- iso: Latn
qualify: cursive
script-type: only-manuscript
time:
notBefore: '1326'
notAfter: '1504'
hands:
count: more-than-10
precision: estimated
license:
name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Page-XML
volume:
- metric: pages
count: 512
- metric: lines
count: 34231
- metric: characters
count: 3320407
- metric: files
count: 512
- metric: regions
count: 2448
transcription-guidelines: >-
- The abbreviations have been resolved, both those by suspension (facimꝰ --->
facimus) and by contraction (dñi --> domini). Likewise, those using
conventional signs (⁊ --> et ; ꝓ --> pro) have been resolved.
- The named entities (names of persons, places and institutions) have been
capitalized. The beginning of a block of text as well as the original capitals
used by the notary are also capitalized.
The consonantal i and u characters have been transcribed as j and v in both
French and Latin.
- The punctuation marks used in the text: . and / have been transcribed, but
the transcription has not been standardized with modern punctuation.
- Corrections and words that appear cancelled in the manuscript have been
transcribed surrounded by the sign $ at the beginning and at the end.
- More specific transcription rules can be found into the file
transcription_guidelines.pdf on Zenodo repository.
Cool, thank you for the update! I added the modifications to the PR and am now merging it. :)