Cleaning the text before segmentation

Question

Cleaning the text before segmentation

databill86 opened this issue 5 years ago · 1 comments

First, thank you for this great tool !

Describe the bug
Can't pass "clean = True" to clean the text before the segmentation, when using the PySBDFactory class.
As indicated, char_span should be False, when clean = True, but this

PySBDFactory(nlp, language='es', clean =True, char_span=False)

is not working.

To Reproduce
pysbd==0.3.0rc0
python 3.7

import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('es')
nlp.add_pipe(PySBDFactory(nlp, language='es', clean =True, char_span=False))

text4="""
1- mi primera oración
ii- mi segunda oración
yo. mi tercera oración
"""
doc = nlp(text4)

This is the complete Traceback :

AttributeError                            Traceback (most recent call last)
<ipython-input-216-0b0fdebddef6> in <module>
----> 1 doc = nlp(text4)

~\anaconda3\envs....\lib\site-packages\spacy\language.py in __call__(self, text, disable, component_cfg)
    437             if not hasattr(proc, "__call__"):
    438                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 439             doc = proc(doc, **component_cfg.get(name, {}))
    440             if doc is None:
    441                 raise ValueError(Errors.E005.format(name=name))

~\anaconda3\envs\......\lib\site-packages\pysbd\utils.py in __call__(self, doc)
     78         sents_char_spans = self.seg.segment(doc.text_with_ws)
     79         start_token_ids = [sent.start for sent in sents_char_spans]
---> 80         for token in doc:
     81             token.is_sent_start = (True if token.idx
     82                                    in start_token_ids else False)

~\anaconda3\envs\.....\lib\site-packages\pysbd\utils.py in <listcomp>(.0)
     78         sents_char_spans = self.seg.segment(doc.text_with_ws)
     79         start_token_ids = [sent.start for sent in sents_char_spans]
---> 80         for token in doc:
     81             token.is_sent_start = (True if token.idx
     82                                    in start_token_ids else False)

AttributeError: 'str' object has no attribute 'start'

Additional context

I have two other questions here :

1- In the class Processor(object): is there any reason why nlp is not a class atrribute :

nlp = spacy.blank('en') (line 10) is not in the class, especially given the fact we have a lang attribute in there :
self.lang = lang

2- Could you please explain why in numbered lists like :

text4="""1- mi primera oración
ii- mi segunda oración
i. mi tercera oración
"""
The last item is not in the same sentence as the first two :

Result :

---- sentence :  1- mi primera oración
---- sentence :  ii- mi segunda oración
---- sentence :  i.
---- sentence :  mi tercera oración

Thank you.

Answer 1 · 2020-06-11T17:53:02.000Z

Hey @databill86, Thanks for pointing out PySBDFactory doc bug in 1st point. So if you are using PySBDFactory then the only language parameter should be available and char_span should by default be True & immutable. The reason is spaCy doc.sents works by setting tok.is_sent_start to True/False where those tokens are the few falling on sentence start boundary character indices obtained from pysbd.
clean is to be set to False strictly since if clean=True then original_text != cleaned_text.
Will push the required changes in scripts and docs in the upcoming release.

nlp = spacy.blank('en') on line 10 is on purpose since I only need spaCy Doc object in return (irrespective of the language) which would be collection of tokens upon which tok.is_sent_start attribute needs to be set.

In the case of list items, pysbd is performing as expected with roman numbered list. roman numbered items would be separated