Complete training set for NER by collecting more samples
Closed this issue · 2 comments
FrancescoCasalegno commented
Actions
- Based on the learning curves of #602 determine, for each entity type, what should be the training set support needed to achieve some decent performance (~60% f1-score).
- Based on the previous estimate of the training set support, determine how many more training samples are needed for each entity type.
- Try to collect paragraphs to annotate in a way that each entity type would reach the required support. We can use the predictions of our currently best NER model, or regex to search in our literature database.
- Ask expert to annotate the sentences, verify that we are close to the target supports, and re-compute learning curves.
- Confirm that our NER models have an F1 score that makes them good enough, and we don't need more NER annotations.
Dependencies
- Requires #602.
FrancescoCasalegno commented
Update 2022-08-16
- Let's give priority to completing a first version of the pipeline, and therefore let's first collect annotations for RE.
- This means, that this current Issue is blocked by #606
FrancescoCasalegno commented
Closing, now merging this Issue with #611