Word and sentence tokenization in Python.
Use this package to split up strings according to sentence and word boundaries. For instance, to simply break up strings into tokens:
tokenize("Joey was a great sailor.")
#=> ["Joey ", "was ", "a ", "great ", "sailor ", "."]
To also detect sentence boundaries:
sent_tokenize("Cat sat mat. Cat's named Cool.", keep_whitespace=True)
#=> [["Cat ", "sat ", "mat", ". "], ["Cat ", "'s ", "named ", "Cool", "."]]
sent_tokenize
can keep the whitespace as-is with the flags keep_whitespace=True
and normalize_ascii=False
.
pip3 install ciseau
Run nose2
.
If you find this project useful for your work or research, here's how you can cite it:
@misc{RaimanCiseau2017,
author = {Raiman, Jonathan},
title = {Ciseau},
year = {2017},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/jonathanraiman/ciseau}},
commit = {fe88b9d7f131b88bcdd2ff361df60b6d1cc64c04}
}