Handle irregularities between pySBD & pySBD + spaCy sentence output
nipunsadvilkar opened this issue · 1 comments
pySBD spaCy pipeline component uses a token-based approach and sets is_sent_start
to True
or False
depending on Span
s obtained from pySBD character offsets. We create Span
objects using doc.char_span
method by creating a slice - doc.text[start:end]
which is a sentence span whose first Token
object needs to have attribute is_sent_start
set to True
. On the other hand, if the character indices don’t map to a valid span it returns None
. Hence we get irregularities in pySBD & pySBD + spaCy sentence output.
The inability to get Span
object from pySBD character offsets can be tackled using the deconstruction of Doc
object like the way PKSHATechnology-Research/camphr authors have written get_doc_char_span
which uses destruct_token
Fixed #63