/spacy-cleaner

Easily clean text with spaCy!

Primary LanguagePythonMIT LicenseMIT

spacy-cleaner

spacy-cleaner

Built with spaCy Build status Python Version Dependencies Status

Code style: black Pre-commit Semantic Versions License codecov Quality Gate Status Easily clean text with spaCy!

Key Features

spacy-cleaner utilises spaCy Language models to replace, remove, and mutate spaCy tokens. Cleaning actions available are:

  • Remove/replace stopwords.
  • Remove/replace punctuation.
  • Remove/replace numbers.
  • Remove/replace emails.
  • Remove/replace URLs.
  • Perform lemmatisation.

See our docs for more information.

Installation

pip install -U spacy-cleaner

or install with Poetry

poetry add spacy-cleaner

πŸ“– Example

spacy-cleaner can clean text written in any language spaCy has a model for:

import spacy
import spacy_cleaner
from spacy_cleaner.processing import removers, replacers, mutators

model = spacy.load("en_core_web_sm")

Class Pipeline allows for configurable cleaning of text using spaCy. The Pipeline is initialised with a model and functions that transform spaCy tokens:

pipeline = spacy_cleaner.Pipeline(
    model,
    removers.remove_stopword_token,
    replacers.replace_punctuation_token,
    mutators.mutate_lemma_token,
)

Next the pipeline can be called with the method clean to clean a list of texts:

texts = ["Hello, my name is Cellan! I love to swim!"]

pipeline.clean(texts)
About the method clean...

The method clean is a wrapper around the spaCy Language class method pipe. Check the docs for more information:

https://spacy.io/api/language#pipe

Giving the output:

['hello _IS_PUNCT_ Cellan _IS_PUNCT_ love swim _IS_PUNCT_']

Makefile usage

Makefile contains a lot of functions for faster development.

1. Download and remove Poetry

To download and install Poetry run:

make poetry-download

To uninstall

make poetry-remove

2. Install all dependencies and pre-commit hooks

Install requirements:

make install

Pre-commit hooks can be installed after git init via

make pre-commit-install

3. Codestyle

Automatic formatting uses pyupgrade, isort and black.

make codestyle

# or use synonym
make formatting

Codestyle checks only, without rewriting files:

make check-codestyle

Note: check-codestyle uses isort, black and darglint library

Update all dev libraries to the latest version using one command

make update-dev-deps
4. Type checks

Run mypy static type checker

make mypy

5. Tests with coverage badges

Run pytest

make test

6. All linters

Of course there is a command to rule run all linters in one:

make lint

the same as:

make test && make check-codestyle && make mypy

7. Cleanup

Delete pycache files

make pycache-remove

Remove package build

make build-remove

Delete .DS_STORE files

make dsstore-remove

Remove .mypy_cache

make mypycache-remove

Or to remove all above run:

make cleanup

πŸ“ˆ Releases

You can see the list of available releases on the GitHub Releases page.

We follow Semantic Versions specification.

We use Release Drafter. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you’re ready. With the categories option, you can categorize pull requests in release notes using labels.

List of labels and corresponding titles

Label Title in Releases
enhancement, feature πŸš€ Features
bug, refactoring, bugfix, fix πŸ”§ Fixes & Refactoring
build, ci, testing πŸ“¦ Build System & CI/CD
breaking πŸ’₯ Breaking Changes
documentation πŸ“ Documentation
dependencies ⬆️ Dependencies updates

You can update it in release-drafter.yml.

GitHub creates the bug, enhancement, and documentation labels for you. Dependabot creates the dependencies label. Create the remaining labels on the Issues tab of your GitHub repository, when you need them.

πŸ›‘ License

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

πŸ“ƒ Citation

@misc{spacy-cleaner,
  author = {spacy-cleaner},
  title = {Easily clean text with spaCy!},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Ce11an/spacy-cleaner}}
}

πŸš€ Credits

This project was generated with python-package-template

This project was built using IntelliJ IDEA

JetBrains Black Box Logo logo