nipunsadvilkar/pySBD

Handle irregularities between pySBD & pySBD + spaCy sentence output

nipunsadvilkar opened this issue · 1 comments

pySBD spaCy pipeline component uses a token-based approach and sets is_sent_start to True or False depending on Spans obtained from pySBD character offsets. We create Span objects using doc.char_span method by creating a slice - doc.text[start:end] which is a sentence span whose first Token object needs to have attribute is_sent_start set to True. On the other hand, if the character indices don’t map to a valid span it returns None . Hence we get irregularities in pySBD & pySBD + spaCy sentence output.

The inability to get Span object from pySBD character offsets can be tackled using the deconstruction of Doc object like the way PKSHATechnology-Research/camphr authors have written get_doc_char_span which uses destruct_token