DiSSCo/SDR

Curate pinned insects dataset

benscott opened this issue ยท 25 comments

Training images of pinned insects, with:

  • Labelled objects
  • Labelled lines
  • Verbatim text

@mlbonhomme and @martinteklia

I have started annotating the Pinned Insects in ArkIndex (https://arkindex.teklia.com/element/f53b200c-8f15-4e5d-b4f5-6db047b95d71) and had a couple of questions:

Can you add some additional types?

  • Specimen
  • Label
  • Scale bar
  • Barcode

How many examples should I annotate?

For our DataMatrix barcodes do you want the exact barcode region labelled as in this image?
Pinned Insect ArkIndex Example

I added the additional types you asked.

How many examples should I annotate?

I think 50 pages should be enough at first.

For our DataMatrix barcodes do you want the exact barcode region labelled as in this image?

Yes, that's good.

To add transcriptions to text lines you should click on the button that looks like A+, instead of changing the name of the text line element.

@mlbonhomme wrote a doc at some point about annotations and she wanted to have a discussion with you about it.

Created repository for training datasets: https://github.com/DiSSCo/sdr-datasets

There is a now out-dated annotation guide here https://gitlab.com/mlbonhomme/arkindex_annotation/-/tree/master/SYNTHESYS which I'll be updating today, I'll share the link to it here when it's done !

Here are the updated annotation guidelines: https://notes.teklia.com/s/zO3O4i_-p. I am of course available to discuss them if there is anything strange or wrong about them, and if you need different element types they can always be added.

I took a look at the existing annotations and some of them will need to be deleted or corrected I think, as I saw that the element tyoe "decoration" was used, and some transcriptions were put as the name of the text line elements โ€” you should all have the necessary user rights to do this.

@mlbonhomme is Specimen 010266087 correctly annotated?

@llivermore it looks fine to me! maybe label 12 should be a barcode/qrcode if we want to try to identify those? or maybe only the qr code itself should be annotated as such, I don't really know what would be best or if it would make much of a difference when training later on

@mlbonhomme Apologies - Label 12 was a duplicate of Barcode 13. I had accidentally renamed the names of the types and duplicated it when attempting to tidy >_< It should be fixed now. I'll go through and fix the rest and will ask one of my team to do some more next Monday (2021-05-10).

@mlbonhomme I have annotated the first 20 specimen images. Could you check them and give feedback/suggestions for improvement?

I had a few questions/notes:

  • I have not added text lines for the scale bar text (e.g. "mm" or the numbers).
  • On Specimen 010516659 are the shapes of Text line 1 and Text line 2 okay? On the same specimen is a single space okay to represent the very large gap between text in Text line 12?
  • On Specimen 010517120 and others, how important is it to capture white space in general? See Text line 1 where there are gaps between the periods but a transcribed would normally interpret this as a date string without spaces.
  • On Specimen 010608284 is it best practice to have separate annotations for a the label and text line for small labels, in this case Label 2 and Text line 9?
  • If we have any characters or words that we are unsure about, can we indicate or flag uncertainty in any way (e.g. for review by one of our handwriting specialists)?
  • There is no need to mark white space in any special way, and text lines with gaps in them are fine
  • It is a little redundant (and annoying to annotate) to have both label and text line when there is only one text line in the label, but I think it will be simpler later on than having to deal with particularities (especially since the annotations are not hierarchical, so to see if a text line is "inside" a label or not we would have to look at the polygons)
  • To flag lines where words/characters are uncertain, we could for example create a class, like to-review or something, that you would put on the relevant text lines so later whoever does this verification can filter the text lines and only see the ones marked as to-review and review them ? And that reviewer would remove the class once the text is clarified/corrected.
  • As for the text lines in scale bars, it's ok not to annotate them, and if later automatic processes find text lines in those labels anyway, if they are correctly identified we can ignore them.

I had a look at the annotated slides, and it isn't really necessary to following ascending/descending letters like in https://arkindex.teklia.com/element/4f28a97e-850d-4339-bf60-cdeb7d1b7d20?highlight=ee1d141c-3e54-4dbd-9484-6b284b17ed90 โ€” for HTR what matters most is the "middle" part, and whatever model we use later for text line detection will likely not create lines with shapes like this. Otherwise it all looks fine!

@martinteklia and @mlbonhomme between myself and Niki (one of my team) we have annotated the first 78 specimens in the Pinned Insects (NHMUK) dataset. We can annotate more when required.

Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)

Do you need anything more from us on the pinned insects? Do you need more for initial testing on the other datasets (e.g. #3 and #7 )?

Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)

If it is possible it would be much better to include additional language labels

@martinteklia and @mlbonhomme between myself and Niki (one of my team) we have annotated the first 78 specimens in the Pinned Insects (NHMUK) dataset. We can annotate more when required.

Thanks! We'll try to train an initial model from the 78 examples, but most likely we'll need more annotated data to improve the model.

Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)

Yes, indicating the language is useful, because it allows us to better analyze the errors. Maybe the model won't work well on French, because there are only a few examples of annotated data.

Do you need anything more from us on the pinned insects? Do you need more for initial testing on the other datasets (e.g. #3 and #7 )?

For the pinned insects it should be ok for the initial training. On Herbarium and Microscope slides there are no specimen nor label annotations - only text lines.

We're working on setting up LabelStudio for annotating the named entities of pinned insects. There's a possibility to split the data into tabs, so each annotator could choose its own tab and there would be no conflicts (two people annotating the same paragraph).

How many annotators will there be? (How many tabs should we create?) @llivermore

@martinteklia for the first 100 or so it will be three of us but two of us are likely to do less. Are you proposing that each annotator gets assigned a fixed number of specimens or lines?
For example:

  • Person A Specimen (page) numbers 1-33
  • Person B Specimen (page) numbers 34-66
  • Person C Specimen (page) numbers 67 - 100

We are likely to have more annotators in the larger ~1,000 specimen/page dataset.

@llivermore Sorry for the delay.

In the end, instead of tabs there are 3 projects: one of which has 60% of the examples and two others have 20%. It's up to you to decide who will annotate which one.

Could you give me the emails of the annotators, so I could send the sign-up link?

@llivermore the annotation presentation is in the attachment

synthesys_annotation_guidelines.odp

@martinteklia we have finished the the named entity tagging. I will have a few questions and one of the digitisers has noticed some errors in the transcribed text.

From Pete Wing on Annotator 2 (I will need to make some decisions on some of these):

  • Task #3208 - unclear what 'G.' is, could also be 'b.', possibly for bred. I wouldn't say it relates to the location. Possibly 'other letters' could be a category?
  • Task #3215 - transcription error - should read 'Roanhead' not 'Roauhead'
  • Task #3216 - presence of Linnaeus after the taxonomy warrants an 'authority' category
  • Task #3217 - 'Dead in Road' and 'leg.' before Max as the collector, how do we categorise these? Dead in road could be 'notes' or similar but leg. really only denotes that the following name is a collector and doesn't work in the same way as det. for determination.
  • Task #3221 transcription error - 'A.E.' not 'G.E.' Gibbs
  • Task #3224 - '*' denotes the specimen was bred, would that be an appropriate category?
  • Task #3225 - transcription error - the year is just 85, not 85'
  • Task #3227 - 'No.65828' is a collector/collection number and should probably have an 'other number' category - we've recorded this kind of thing under these heading in recent transcriptions
  • Task #3232 - transcription missing - 'Selys' also present and would be the authority
  • Task #3241 - 'Rehn & Rehn', also needs authority as a category
  • Task #3246 - this has two collector names present, so these should be treated separately as independant 'name' fields. This label similarly has 'Coll:', like Max's 'leg.', prior to the names, does collector need to be a category to encompass this?

@martinteklia A question both Pete and I had is, "can/should a single string/set of words have more than one category applied to it? E.g. a determination may encompass the taxanomic name, authority, determiner name and date of determination, so the main category for this data is determination but the constituent parts have their own, separate categories."

See example below where we have a collection/donation, indicated by the "from" prefix before the person's name "K. C. Liew".

Composite example

@llivermore The transcription errors will have to be fixed on arkindex - it's not possible to do it in labelstudio.
The errors must be fixed at text_line level. From the lines we will generate new paragraph level transcriptions.
Those will be imported again to labelstudio for NER annotation. Previous task with incorrect transcription must be deleted first.
A link to the page on arkindex will be displayed in the labelstudio task to make it easier to find it on arkindex and fix the transcription. (Unfortunately we're unable to make the link clickable)

@martinteklia A question both Pete and I had is, "can/should a single string/set of words have more than one category applied to it? E.g. a determination may encompass the taxanomic name, authority, determiner name and date of determination, so the main category for this data is determination but the constituent parts have their own, separate categories."

See example below where we have a collection/donation, indicated by the "from" prefix before the person's name "K. C. Liew".

The NER tool we use doesn't support nested entities (multiple categories). So we said earlier, that for the MVP we won't support nested entities either. It can be a future development.

Task #3224 - '*' denotes the specimen was bred, would that be an appropriate category?

If that's always the case it might be easier to have a process with custom logic like - if no taxon entity found and text contains * then it is bred.

All nested entities have been removed.

  • Task #3208 - unclear what 'G.' is, could also be 'b.', possibly for bred. I wouldn't say it relates to the location. Possibly 'other letters' could be a category? [Apply custom logic?]
  • Task #3215 - transcription error - should read 'Roanhead' not 'Roauhead' [Corrected in AI]
  • Task #3216 - presence of Linnaeus after the taxonomy warrants an 'authority' category [Needs discussion]
  • Task #3217 - 'Dead in Road' and 'leg.' before Max as the collector, how do we categorise these? Dead in road could be 'notes' or similar but leg. really only denotes that the following name is a collector and doesn't work in the same way as det. for determination. [Needs discussion]
  • Task #3221 transcription error - 'A.E.' not 'G.E.' Gibbs [Corrected in AI]
  • Task #3224 - '*' denotes the specimen was bred, would that be an appropriate category? [Apply custom logic]
  • Task #3225 - transcription error - the year is just 85, not 85'
  • Task #3227 - 'No.65828' is a collector/collection number and should probably have an 'other number' category - we've recorded this kind of thing under these heading in recent transcriptions [Classed as identifier - then apply custom logic]
  • Task #3232 - transcription missing - 'Selys' also present and would be the authority [Corrected in AI]
  • Task #3241 - 'Rehn & Rehn', also needs authority as a category [Apply custom logic?]
  • Task #3246 - this has two collector names present, so these should be treated separately as independent 'name' fields. This label similarly has 'Coll:', like Max's 'leg.', prior to the names, does collector need to be a category to encompass this? [Apply custom logic?]

@martinteklia The following need reloading from Arkindex:
https://arkindex.teklia.com/element/6ac2db87-6569-4290-8fd6-b2ee2fc6be7a
https://arkindex.teklia.com/element/03fe9029-3c3d-43bd-9ce8-b36e2cd94975
https://arkindex.teklia.com/element/00964e92-1f98-4ea7-91fd-40cdeec91b44
https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d
https://arkindex.teklia.com/element/8dfcfc38-33bd-41e6-89f7-92e0e5aa9926

The following need checking by Pete (and probably reloading):
https://arkindex.teklia.com/element/ec353ce8-a1d7-400a-817e-f79d54f17016
https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d
https://labelstudio.arkindex.org/projects/4/data?tab=94&task=3182

@martinteklia I have finished labelling all known entities - I think we are ready to train and evaluate! :)

Note for myself: the trained model doesn't seem to work very well with iCollections images. They have noisy backgrounds from pin holes but otherwise the labels are similar.

From Galaxy test:
image

Source specimen: BMNH(E)1851836