Try improving NER model performance by leveraging ontotology-generated text
Closed this issue · 4 comments
FrancescoCasalegno commented
Context
- DKE has generated texts from the
BRAIN_REGION
ontology and Wikipedia. These sentences also come with annotations which should give us information on the entity type. - It is desirable that our NLP models become "ontology aware". We can do like the authors of KELM and enrich our training data with these texts generated from the ontology, so that the resulting NLP model will (hopefully) learn some of the information contained in the ontology.
- We have two possibilities.
Actions
- Since we have entity type annotations, it makes sense to try approach 2., as shown in the example below. Compare the performance curves using this enriched training set with what we obtain w/o enriching the training set with these generated sentences.
- The generated sentences are about
BRAIN_REGION
but they may contain also mentions of other entity types. We can detect those entities either using our NER model or by human inspection. Are these other entities also annotated or not? Are there many of them? - If the generated sentences include many entities of other types w/o annotations, or even just for the fact that we enrich our training set with many sentences that are unbalanced (i.e. exclusively focussing on
BRAIN_REGION
), we may worsen the performance of the NER model on the other entity types. Is this really the case? Is at least the performance forBRAIN_REGION
any better? - Based on these results, decide whether using ontology-generated sentences can help us. If Yes, then ask DKE to generate also for the other entity types.
Dependencies
- Requires #604
Footnotes
EmilieDel commented
Here are the results of the first experiment using the ontology to improve NER model performance.
Context
- Spacy model (using default
config.cfg
) is used - Same splits train/validation/test split is used along the different experiments
- 220/54/65 if the ontology is not integrated
- 1681/54/65 if the ontology is added into the training set
- Different experiments are:
Ontology - Full
: Training with addition of ontology sentences (entiresynthetic_text.json
)Ontology - One Sentence
: Training with addition of ontology sentences (only the first sentence of every paragraph found in thesynthetic_text.json
No Ontology
: Training without any addition of ontology sentences into the training
F1-score
Recall
Precision
Analysis
Currently, the trends are:
- The addition of ontology is not helping the
BRAIN_REGION
entity type that we try to enrich. It is also important to notice that both precision and recall dropped. - The addition of ontology sentences is making the results for the other entity types slightly worse but not significantly (except for
CELL_COMPARTMENT
where performance dropped).
Ontology analysis
Here are some interrogations I have after looking closer to this ontology sentences:
- Does it make sense to have those entire paragraph showing always the same entity and keeping the same structure. Shouldn't we at least split the sentences into different samples?
- I am wondering if those entities are really realistic (e.g.
Midbrain, behavioral state related
,Visceral area, layer XX
,Cortical amygdalar area, posterior part, lateral zone, layers 1-2
, ...). Do them appear in papers as referred in the ontology? Shouldn't we at least remove the text after the first comma appearing in the entity? - (Low occurrence) Why do we have part of the first sentence that is not annotated at all in the second image?
- To answer one of the interrogation of this ticket (
The generated sentences are about BRAIN_REGION but they may contain also mentions of other entity types.
), I don't thinksynthetic_text.json
contains some entities from the other entity types of interest.
EmilieDel commented
EmilieDel commented
EmilieDel commented
- Sentences from Allen Brain ontology seems not to be enough realistic to improve the performances of the model (too much entities - very dense, not realistic entities (
XXX, dorsal part, layer XX
), ... - Wiki texts are a bit better but it feels like a lot of annotations are missing.