BlueBrain/Search

Complete training set for NER by collecting more samples

Closed this issue · 2 comments

Actions

  • Based on the learning curves of #602 determine, for each entity type, what should be the training set support needed to achieve some decent performance (~60% f1-score).
  • Based on the previous estimate of the training set support, determine how many more training samples are needed for each entity type.
  • Try to collect paragraphs to annotate in a way that each entity type would reach the required support. We can use the predictions of our currently best NER model, or regex to search in our literature database.
  • Ask expert to annotate the sentences, verify that we are close to the target supports, and re-compute learning curves.
  • Confirm that our NER models have an F1 score that makes them good enough, and we don't need more NER annotations.

Dependencies

Update 2022-08-16

  • Let's give priority to completing a first version of the pipeline, and therefore let's first collect annotations for RE.
  • This means, that this current Issue is blocked by #606

Closing, now merging this Issue with #611