BlueBrain/Search

Try improving NER model performance by leveraging ontotology-generated text

Closed this issue · 4 comments

Context

  • DKE has generated texts from the BRAIN_REGION ontology and Wikipedia. These sentences also come with annotations which should give us information on the entity type.
  • It is desirable that our NLP models become "ontology aware". We can do like the authors of KELM and enrich our training data with these texts generated from the ontology, so that the resulting NLP model will (hopefully) learn some of the information contained in the ontology.
  • We have two possibilities.
    1. Pre-train the backbone on a (self-supervised) Language Model task using a literature database enriched by the generated texts1.
    2. Train the head on a (supervised) NER task using the set of training samples annotated by an expert enriched by the generated texts2.

Actions

  • Since we have entity type annotations, it makes sense to try approach 2., as shown in the example below. Compare the performance curves using this enriched training set with what we obtain w/o enriching the training set with these generated sentences.
  • The generated sentences are about BRAIN_REGION but they may contain also mentions of other entity types. We can detect those entities either using our NER model or by human inspection. Are these other entities also annotated or not? Are there many of them?
  • If the generated sentences include many entities of other types w/o annotations, or even just for the fact that we enrich our training set with many sentences that are unbalanced (i.e. exclusively focussing on BRAIN_REGION), we may worsen the performance of the NER model on the other entity types. Is this really the case? Is at least the performance for BRAIN_REGION any better?
  • Based on these results, decide whether using ontology-generated sentences can help us. If Yes, then ask DKE to generate also for the other entity types.

Dependencies

Footnotes

  1. This is the approach described in KELM paper.

  2. Unlike what described in KELM paper, we can take this approach because our generated sentences come with annotations on the entity types.

Here are the results of the first experiment using the ontology to improve NER model performance.

Context

  • Spacy model (using default config.cfg) is used
  • Same splits train/validation/test split is used along the different experiments
    • 220/54/65 if the ontology is not integrated
    • 1681/54/65 if the ontology is added into the training set
  • Different experiments are:
    • Ontology - Full: Training with addition of ontology sentences (entire synthetic_text.json)
    • Ontology - One Sentence: Training with addition of ontology sentences (only the first sentence of every paragraph found in the synthetic_text.json
    • No Ontology: Training without any addition of ontology sentences into the training

F1-score

image

Recall

image

Precision

image

Analysis

Currently, the trends are:

  • The addition of ontology is not helping the BRAIN_REGION entity type that we try to enrich. It is also important to notice that both precision and recall dropped.
  • The addition of ontology sentences is making the results for the other entity types slightly worse but not significantly (except for CELL_COMPARTMENT where performance dropped).

Ontology analysis

image
image

Here are some interrogations I have after looking closer to this ontology sentences:

  • Does it make sense to have those entire paragraph showing always the same entity and keeping the same structure. Shouldn't we at least split the sentences into different samples?
  • I am wondering if those entities are really realistic (e.g. Midbrain, behavioral state related, Visceral area, layer XX, Cortical amygdalar area, posterior part, lateral zone, layers 1-2, ...). Do them appear in papers as referred in the ontology? Shouldn't we at least remove the text after the first comma appearing in the entity?
  • (Low occurrence) Why do we have part of the first sentence that is not annotated at all in the second image?
  • To answer one of the interrogation of this ticket (The generated sentences are about BRAIN_REGION but they may contain also mentions of other entity types.), I don't think synthetic_text.json contains some entities from the other entity types of interest.

One of the problem seems to be that sometimes the first part of the sentence is annotated as a BRAIN REGION
image
It is the case of 11 paragraphs out of 65 tests paragraphs.

wiki_text.json (second source given by DKE) seems not helping either.
image
image
Questions: are all the brain regions well annotated on those texts ?

  • Sentences from Allen Brain ontology seems not to be enough realistic to improve the performances of the model (too much entities - very dense, not realistic entities (XXX, dorsal part, layer XX), ...
  • Wiki texts are a bit better but it feels like a lot of annotations are missing.