Sentencizer cut codes in different sentences while it's the same token
etienneguevel opened this issue · 1 comments
etienneguevel commented
Description
For the moment the sentencizer makes a new sentence when there is a "." character followed by a capitalized letter.
This can be problematic for some codes or accronyms, as they can be constructed with those patterns (example : "V.I.H",), and will be divided in different sentences.
The ADICAP codes analysed by the eds.adicap
pipeline can be found in text in the form : "code ADICAP : B.H.HP.A7A0", and the eds.contextual-matcher
used behind will not capture the code.
A solution would be to create a new sentence if there is a . followed by a space/new line/other separation and a capitalized letter.
How to reproduce the bug
import spacy
nlp = spacy.blank("eds")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")
code = "B.H.HP.A7A0"
for sent in nlp(code).sents:
print(sent.text)
B.
H.
HP.
A7A0
Your Environment
- Operating System: Ubuntu 22.04.1 LTS
- Python Version Used: 3.10.6
- spaCy Version Used: 3.4.1
- EDS-NLP Version Used: 0.7.4
- Environment Information: