vmenger/deduce

Can obtain nested tags

Closed this issue · 2 comments

If you input this text into Deduce:
'ADHD Adres: Naamlaan 100 Woonplaats: 3512AB Apeldoorn Tel: 088-1234567'
and run deduce.annotate_text, you will obtain:
'ADHD <PERSOON Adres: <LOCATIE Naamlaan 100> Woonplaats: 3512AB Apeldoorn Tel>: <TELEFOONNUMMER 088-1234567>'
which includes nested tags. Obviously there is a problem whereby the entired string from "Adres" to "Tel" is being detected as a person's name. However, the problem I'm pointing out here is that, having detected that, it then finds a LOCATIE tag within the PERSOON tag, which means that the final output contains nested tags, which should not be allowed.

This should be fixed easily by moving the flatten_text call, currently happening within the "names" deidentification, to the very end of the annotate_text method, right before returning the final text. Do you agree with this?

Thanks, I'll let you know when I found the time to understand what is happening exactly in this issue/PR.

The main question is why this:
"Adres: Naamlaan 100 Woonplaats: 3512AB Apeldoorn Tel"
gets annotated as a single PERSOON by annotate_names

However, once you accept that this is the case, then what is happening is quite simple: the text within the previously annotated text "Naamlaan 100" gets recognized as an address and annotated, so we end up with nested tags