Cleaning the text before segmentation
databill86 opened this issue · 1 comments
First, thank you for this great tool !
Describe the bug
Can't pass "clean = True" to clean the text before the segmentation, when using the PySBDFactory class.
As indicated, char_span
should be False, when clean = True
, but this
PySBDFactory(nlp, language='es', clean =True, char_span=False)
is not working.
To Reproduce
pysbd==0.3.0rc0
python 3.7
import spacy
from pysbd.utils import PySBDFactory
nlp = spacy.blank('es')
nlp.add_pipe(PySBDFactory(nlp, language='es', clean =True, char_span=False))
text4="""
1- mi primera oración
ii- mi segunda oración
yo. mi tercera oración
"""
doc = nlp(text4)
This is the complete Traceback :
AttributeError Traceback (most recent call last)
<ipython-input-216-0b0fdebddef6> in <module>
----> 1 doc = nlp(text4)
~\anaconda3\envs....\lib\site-packages\spacy\language.py in __call__(self, text, disable, component_cfg)
437 if not hasattr(proc, "__call__"):
438 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 439 doc = proc(doc, **component_cfg.get(name, {}))
440 if doc is None:
441 raise ValueError(Errors.E005.format(name=name))
~\anaconda3\envs\......\lib\site-packages\pysbd\utils.py in __call__(self, doc)
78 sents_char_spans = self.seg.segment(doc.text_with_ws)
79 start_token_ids = [sent.start for sent in sents_char_spans]
---> 80 for token in doc:
81 token.is_sent_start = (True if token.idx
82 in start_token_ids else False)
~\anaconda3\envs\.....\lib\site-packages\pysbd\utils.py in <listcomp>(.0)
78 sents_char_spans = self.seg.segment(doc.text_with_ws)
79 start_token_ids = [sent.start for sent in sents_char_spans]
---> 80 for token in doc:
81 token.is_sent_start = (True if token.idx
82 in start_token_ids else False)
AttributeError: 'str' object has no attribute 'start'
Additional context
I have two other questions here :
1- In the class Processor(object):
is there any reason why nlp is not a class atrribute :
nlp = spacy.blank('en')
(line 10) is not in the class, especially given the fact we have a lang attribute in there :
self.lang = lang
2- Could you please explain why in numbered lists like :
text4="""1- mi primera oración
ii- mi segunda oración
i. mi tercera oración
"""
The last item is not in the same sentence as the first two :
Result :
---- sentence : 1- mi primera oración
---- sentence : ii- mi segunda oración
---- sentence : i.
---- sentence : mi tercera oración
Thank you.
Hey @databill86, Thanks for pointing out PySBDFactory
doc bug in 1st point. So if you are using PySBDFactory
then the only language
parameter should be available and char_span
should by default be True
& immutable. The reason is spaCy doc.sents
works by setting tok.is_sent_start
to True/False
where those tokens are the few falling on sentence start boundary character indices obtained from pysbd
.
clean
is to be set to False
strictly since if clean=True
then original_text != cleaned_text.
Will push the required changes in scripts and docs in the upcoming release.
nlp = spacy.blank('en')
on line 10 is on purpose since I only need spaCyDoc
object in return (irrespective of the language) which would be collection of tokens upon whichtok.is_sent_start
attribute needs to be set.
In the case of list items, pysbd is performing as expected with roman numbered list. roman numbered items would be separated