Complete training set for NER by collecting more samples

Question

Closed this issue 2 years ago · 2 comments

Actions

Based on the learning curves of #602 determine, for each entity type, what should be the training set support needed to achieve some decent performance (~60% f1-score).
Based on the previous estimate of the training set support, determine how many more training samples are needed for each entity type.
Try to collect paragraphs to annotate in a way that each entity type would reach the required support. We can use the predictions of our currently best NER model, or regex to search in our literature database.
Ask expert to annotate the sentences, verify that we are close to the target supports, and re-compute learning curves.
Confirm that our NER models have an F1 score that makes them good enough, and we don't need more NER annotations.

Answer 1 · 2022-08-16T08:01:02.000Z

Let's give priority to completing a first version of the pipeline, and therefore let's first collect annotations for RE.
This means, that this current Issue is blocked by #606

Answer 2 · 2022-08-16T08:45:42.000Z

Closing, now merging this Issue with #611