
NLP Preprocessing Pipeline Wrappers

Primary LanguagePython

🍺IPA: import, preprocess, accelerate

PyTorch Stanza SpaCy Code style: black

Upload to PyPi PyPi Version DeepSource

🍺IPA: import, preprocess, accelerate

How to use


Install the library from PyPI:

pip install ipa-core


IPA is a Python library that provides a set of preprocessing wrappers for Stanza and spaCy, providing a unified API for both libraries, making them interchangeable.

Let's start with a simple example. Here we are using the SpacyTokenizer wrapper to preprocess a text:

from ipa import SpacyTokenizer

spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .

You can load any model from spaCy, with its canonical name, en_core_web_sm, or with a simple alias, as we did here, like en. By default, the simpler alias loads the smaller version of each model. For a complete list of available models, see spaCy documentation.

In the very same way, you can load any model from Stanza using the StanzaTokenizer wrapper:

from ipa import StanzaTokenizer

stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .

For more simple scenarios, you can use the WhiteSpaceTokenizer wrapper, which will just split the text by whitespace:

from ipa import WhitespaceTokenizer

whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
    print("{:<5} {:<10}".format(word.index, word.text))

0    Mary
1    sold
2    the
3    car
4    to
5    John
6    .


Complete preprocessing pipeline

SpacyTokenizer and StanzaTokenizer provide a unified API for both libraries, exposing most of their features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate and deactivate any of these using return_pos_tags, return_lemmas and return_deps. So, for example,

StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)

will return a list of Token objects, with the pos and lemma fields filled.



will return a list of Token objects, with only the text field filled.

GPU support

With use_gpu=True, the library will use the GPU if it is available. To set up the environment for the GPU, refer to the Stanza documentation and the spaCy documentation.




class SpacyTokenizer(BaseTokenizer):
    def __init__(
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,


class StanzaTokenizer(BaseTokenizer):
    def __init__(
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,


class WhitespaceTokenizer(BaseTokenizer):
    def __init__(self):

Sentence Splitter


class SpacySentenceSplitter(BaseSentenceSplitter):
    def __init__(self, language: str = "en", model_type: str = "statistical"):