Try improving NER model performance by leveraging ontotology-generated text

Question

Try improving NER model performance by leveraging ontotology-generated text

Closed this issue 2 years ago · 4 comments

FrancescoCasalegno commented 3 years ago

Context

DKE has generated texts from the BRAIN_REGION ontology and Wikipedia. These sentences also come with annotations which should give us information on the entity type.
It is desirable that our NLP models become "ontology aware". We can do like the authors of KELM and enrich our training data with these texts generated from the ontology, so that the resulting NLP model will (hopefully) learn some of the information contained in the ontology.
We have two possibilities.
1. Pre-train the backbone on a (self-supervised) Language Model task using a literature database enriched by the generated texts¹.
2. Train the head on a (supervised) NER task using the set of training samples annotated by an expert enriched by the generated texts².

Actions

Since we have entity type annotations, it makes sense to try approach 2., as shown in the example below. Compare the performance curves using this enriched training set with what we obtain w/o enriching the training set with these generated sentences.
The generated sentences are about BRAIN_REGION but they may contain also mentions of other entity types. We can detect those entities either using our NER model or by human inspection. Are these other entities also annotated or not? Are there many of them?
If the generated sentences include many entities of other types w/o annotations, or even just for the fact that we enrich our training set with many sentences that are unbalanced (i.e. exclusively focussing on BRAIN_REGION), we may worsen the performance of the NER model on the other entity types. Is this really the case? Is at least the performance for BRAIN_REGION any better?
Based on these results, decide whether using ontology-generated sentences can help us. If Yes, then ask DKE to generate also for the other entity types.

Dependencies

Requires #604

This is the approach described in KELM paper. ↩
Unlike what described in KELM paper, we can take this approach because our generated sentences come with annotations on the entity types. ↩

Answer 1 · 2022-06-30T08:41:09.000Z

Here are the results of the first experiment using the ontology to improve NER model performance.

Context

Spacy model (using default config.cfg) is used
Same splits train/validation/test split is used along the different experiments
- 220/54/65 if the ontology is not integrated
- 1681/54/65 if the ontology is added into the training set
Different experiments are:
- Ontology - Full: Training with addition of ontology sentences (entire synthetic_text.json)
- Ontology - One Sentence: Training with addition of ontology sentences (only the first sentence of every paragraph found in the synthetic_text.json
- No Ontology: Training without any addition of ontology sentences into the training

F1-score

Recall

Precision

Analysis

Currently, the trends are:

The addition of ontology is not helping the BRAIN_REGION entity type that we try to enrich. It is also important to notice that both precision and recall dropped.
The addition of ontology sentences is making the results for the other entity types slightly worse but not significantly (except for CELL_COMPARTMENT where performance dropped).

Ontology analysis

Here are some interrogations I have after looking closer to this ontology sentences:

Does it make sense to have those entire paragraph showing always the same entity and keeping the same structure. Shouldn't we at least split the sentences into different samples?
I am wondering if those entities are really realistic (e.g. Midbrain, behavioral state related, Visceral area, layer XX, Cortical amygdalar area, posterior part, lateral zone, layers 1-2, ...). Do them appear in papers as referred in the ontology? Shouldn't we at least remove the text after the first comma appearing in the entity?
(Low occurrence) Why do we have part of the first sentence that is not annotated at all in the second image?
To answer one of the interrogation of this ticket (The generated sentences are about BRAIN_REGION but they may contain also mentions of other entity types.), I don't think synthetic_text.json contains some entities from the other entity types of interest.

Answer 2 · 2022-07-01T08:01:48.000Z

One of the problem seems to be that sometimes the first part of the sentence is annotated as a BRAIN REGION

It is the case of 11 paragraphs out of 65 tests paragraphs.

Answer 3 · 2022-07-01T15:17:51.000Z

wiki_text.json (second source given by DKE) seems not helping either.

Questions: are all the brain regions well annotated on those texts ?

Answer 4 · 2022-07-12T09:31:57.000Z

Sentences from Allen Brain ontology seems not to be enough realistic to improve the performances of the model (too much entities - very dense, not realistic entities (XXX, dorsal part, layer XX), ...
Wiki texts are a bit better but it feels like a lot of annotations are missing.

Context

Actions

Dependencies

Footnotes

Context

F1-score

Recall

Precision

Analysis

Ontology analysis