BlueBrain/Search

Try improving NER model performance by using publicly available NER datasets

Closed this issue · 9 comments

Context

  • In #605 we tried to improve our NER model by leveraging the NER-annotated sentences generated from the Ontology.
  • However, that didn't work because the quality of the annotations was too poor.
  • Instead, we can think of using any of the publicly available datasets for NER with biological entity types.
  • For instance (but not limited to) we can look to the corpora used by SciSpaCy to train their models: CRAFT, JNLPBA, BC5CDR, BIONLP13CG.

Actions

  • Find publicly available annotated NER datasets that cover some of the entity types we also want.
  • Think about how to handle the fact that some of those dataests may not have annotations for all our entity types, and at the same time they may have annotations for entity types we do not care about.
    • use label -100 to mask tokens so the torch loss function ignores them?
    • train one NER model per entity type? but then how to resolve conflicts?
  • See if learning curves show higher performance (e.g. better intercept + same slope) than what we got in #601 and #602. To have comparable results, we could do the following (?):
    • train on 1/8 of our data + all external datasets
    • train on 2/8 of our data + all external datasets
    • train on 4/8 of our data + all external datasets
    • train on 8/8 of our data + all external datasets

We tried to summarize different approaches in a sketch
Screenshot 2022-07-13 at 14 45 30

Here are some examples
Screenshot 2022-07-13 at 14 48 26

It would be good to hear your thoughts.

bionlp13cg has 16 entity types.
AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE

After trying a simple 1:1 correspondence between our entity types and the entity types of the model.

Our entity type Model entity type
GENE GENE_OR_GENE_PRODUCT
CELL_TYPE CELL
BRAIN_REGION ANATOMICAL_SYSTEM
CELL_COMPARTMENT CELLULAR_COMPONENT
ORGANISM ORGANISM

Here are the results we obtain without any fine tuning:

precision recall f1-score support
BRAIN_REGION 0.23 0.18 0.20 345
CELL_COMPARTMENT 0.17 0.45 0.25 177
CELL_TYPE 0.28 0.48 0.35 677
GENE 0.55 0.67 0.60 1469
ORGANISM 0.28 0.46 0.34 279

Note: I tried to find the raw dataset (in NER format) but it seems complicated to find.

We experimented with the M1 approach (replacing O in partially annotated datasets with IGNORE) and used the following external datasets

We took a random stratified split of our fully-annotated dataset. See below the definition of each of the datasets/models

  • internal - only trained on train samples of our internal fully-annotated dataset
  • external_2 - train samples of our internal fully-annotated dataset + bc2gm + jnlpba
  • external_3 - train samples of our internal fully-annotated dataset + bc2gm1 + jnlpba + species_800

Test set performance
Screenshot 2022-07-22 at 09 42 15

Train set (internal) performance

Screenshot 2022-07-22 at 09 22 18

Tricky points/issues

  • It seems like overfitting the internal training set is actually not a terrible strategy to get good results on the internal test set. This IMO suggests that it is really hard to make any conclusions about generalization
  • The M1 scheme effectively introduces a huge class imbalance
  • Our internal fully-annotated dataset (~200 training samples) is tiny compared to the external ones (50,000 + samples). We did not assign bigger sample weights to our internal samples and IMO the model might not care about them that much
    during training

Discussed during 26-07 meeting
TO DOs:

  • k-fold on the internal dataset and compute means of experiments
  • Train on the external (partially annotated) dataset and then "fine-tune" on the internally (fully annotated) datasets
  • (Less important) Training phase 1 with M2 approach

K-fold cross-validation with 5 folds

F1 score

Test

Screenshot 2022-08-02 at 09 38 32

Train

Screenshot 2022-08-02 at 09 38 51

Planning 2022-08-02

  • Look at results of train + eval (k-fold cross-validation) after #607 fixes the annotations in the "ground truth"
  • Try to "pre-train" on the external (partially annotated) NER dataset and then "fine-tune" on the internally (fully annotated) NER datasets

K-fold cross-validation with 5 folds using the original (not corrected) annotations.

external data = bc2gm_corpus and jnlpba

  • internal - only trained on fully annotated data
  • external_simul - fully annotated data + external data were concatenated and the network was trained on this dataset
    • The reason why the performance is worse than what was shown in the previous post is that this time the validation set consisted both of external data and internal data (before it was just the internal data)
  • external_seq- we first trained in external data and then trained on fully annotated internal data (sequential logic)

F1 score

Test

Screenshot 2022-08-12 at 11 19 05

Train

Screenshot 2022-08-12 at 11 23 05

K-fold cross-validation with 5 folds using the corrected annotation. The rest is the same as above

F1-score

Train

Screenshot 2022-08-15 at 13 35 51

Test

Screenshot 2022-08-15 at 13 35 43

Update 2022-08-16

  • Based on the results shown in #608 (comment), it seems that merging the partially annotated NER samples with the fully annotated ones from BBP gives bad results. Possibly, this is because the partially annotated ones outnumber the fully annotated, high quality ones.
  • Based on the results shown in #608 (comment), pre-training on the partially annotated NER samples does not decrease but neither it significantly improves the accuracy of the final NER model.

Decision

  • For the time being, it does not seem like we can leverage (partially annotated) publicly available NER datasets to improve the performance of our NER models.