pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.
This project is a direct port of ruby gem - Pragmatic Segmenter which provides rule-based sentence boundary detection.
Python
pip install pysbd
- Currently pySBD supports only English language. Support for more languages will be released soon.
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']
- Use
pysbd
as a spaCy pipeline component. (recommended)
Please refer to example pysbd_as_spacy_component.py - Use pysbd through entrypoints
import spacy
from pysbd.utils import PySBDFactory
nlp = spacy.blank('en')
# explicitly adding component to pipeline
# (recommended - makes it more readable to tell what's going on)
nlp.add_pipe(PySBDFactory(nlp))
# or you can use it implicitly with keyword
# pysbd = nlp.create_pipe('pysbd')
# nlp.add_pipe(pysbd)
doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]
If you find a text that is incorrectly segmented using pySBD, please submit an issue.
- Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
This project wouldn't be possible without the great work done by Pragmatic Segmenter team.