Try improving NER model performance by using publicly available NER datasets
Closed this issue · 9 comments
Context
- In #605 we tried to improve our NER model by leveraging the NER-annotated sentences generated from the Ontology.
- However, that didn't work because the quality of the annotations was too poor.
- Instead, we can think of using any of the publicly available datasets for NER with biological entity types.
- For instance (but not limited to) we can look to the corpora used by SciSpaCy to train their models: CRAFT, JNLPBA, BC5CDR, BIONLP13CG.
Actions
- Find publicly available annotated NER datasets that cover some of the entity types we also want.
- Think about how to handle the fact that some of those dataests may not have annotations for all our entity types, and at the same time they may have annotations for entity types we do not care about.
- use label
-100
to mask tokens so thetorch
loss function ignores them? - train one NER model per entity type? but then how to resolve conflicts?
- use label
- See if learning curves show higher performance (e.g. better intercept + same slope) than what we got in #601 and #602. To have comparable results, we could do the following (?):
- train on 1/8 of our data + all external datasets
- train on 2/8 of our data + all external datasets
- train on 4/8 of our data + all external datasets
- train on 8/8 of our data + all external datasets
- List of relevant datasets: https://corposaurus.github.io/corpora/
- Literature on approaches how to handle partially annotated datasets in NER:
- They basically treat the problem as multi-label classification (each entity type will have a separate binary classifier) arXiv researchgate
- More theoretical, they assume that we have a CRF model (which we don't): https://arxiv.org/abs/2005.00502
We tried to summarize different approaches in a sketch
It would be good to hear your thoughts.
bionlp13cg
has 16 entity types.
AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE
After trying a simple 1:1 correspondence between our entity types and the entity types of the model.
Our entity type | Model entity type |
---|---|
GENE | GENE_OR_GENE_PRODUCT |
CELL_TYPE | CELL |
BRAIN_REGION | ANATOMICAL_SYSTEM |
CELL_COMPARTMENT | CELLULAR_COMPONENT |
ORGANISM | ORGANISM |
Here are the results we obtain without any fine tuning:
precision | recall | f1-score | support | |
---|---|---|---|---|
BRAIN_REGION | 0.23 | 0.18 | 0.20 | 345 |
CELL_COMPARTMENT | 0.17 | 0.45 | 0.25 | 177 |
CELL_TYPE | 0.28 | 0.48 | 0.35 | 677 |
GENE | 0.55 | 0.67 | 0.60 | 1469 |
ORGANISM | 0.28 | 0.46 | 0.34 | 279 |
Note: I tried to find the raw dataset (in NER format) but it seems complicated to find.
We experimented with the M1 approach (replacing O
in partially annotated datasets with IGNORE) and used the following external datasets
- https://huggingface.co/datasets/bc2gm_corpus -
GENE
- https://huggingface.co/datasets/jnlpba -
CELL_TYPE
+GENE
- https://huggingface.co/datasets/species_800 -
ORGANISM
We took a random stratified split of our fully-annotated dataset. See below the definition of each of the datasets/models
internal
- only trained ontrain
samples of our internal fully-annotated datasetexternal_2
-train
samples of our internal fully-annotated dataset +bc2gm
+jnlpba
external_3
-train
samples of our internal fully-annotated dataset +bc2gm1
+jnlpba
+species_800
Train set (internal) performance
Tricky points/issues
- It seems like overfitting the internal training set is actually not a terrible strategy to get good results on the internal test set. This IMO suggests that it is really hard to make any conclusions about generalization
- The M1 scheme effectively introduces a huge class imbalance
- Our internal fully-annotated dataset (~200 training samples) is tiny compared to the external ones (50,000 + samples). We did not assign bigger sample weights to our internal samples and IMO the model might not care about them that much
during training
Discussed during 26-07 meeting
TO DOs:
- k-fold on the internal dataset and compute means of experiments
- Train on the external (partially annotated) dataset and then "fine-tune" on the internally (fully annotated) datasets
- (Less important) Training phase 1 with M2 approach
Planning 2022-08-02
- Look at results of train + eval (k-fold cross-validation) after #607 fixes the annotations in the "ground truth"
- Try to "pre-train" on the external (partially annotated) NER dataset and then "fine-tune" on the internally (fully annotated) NER datasets
K-fold cross-validation with 5 folds using the original (not corrected) annotations.
external data = bc2gm_corpus
and jnlpba
internal
- only trained on fully annotated dataexternal_simul
- fully annotated data + external data were concatenated and the network was trained on this dataset- The reason why the performance is worse than what was shown in the previous post is that this time the validation set consisted both of external data and internal data (before it was just the internal data)
external_seq
- we first trained in external data and then trained on fully annotated internal data (sequential logic)
F1 score
Test
Train
Update 2022-08-16
- Based on the results shown in #608 (comment), it seems that merging the partially annotated NER samples with the fully annotated ones from BBP gives bad results. Possibly, this is because the partially annotated ones outnumber the fully annotated, high quality ones.
- Based on the results shown in #608 (comment), pre-training on the partially annotated NER samples does not decrease but neither it significantly improves the accuracy of the final NER model.
Decision
- For the time being, it does not seem like we can leverage (partially annotated) publicly available NER datasets to improve the performance of our NER models.