spacy-cleaner utilises spaCy
Language
models to replace, remove, and
mutate spaCy
tokens. Cleaning actions available are:
- Remove/replace stopwords.
- Remove/replace punctuation.
- Remove/replace numbers.
- Remove/replace emails.
- Remove/replace URLs.
- Perform lemmatisation.
See our docs for more information.
pip install -U spacy-cleaner
or install with Poetry
poetry add spacy-cleaner
spacy-cleaner can clean text written in any language spaCy
has a model
for:
import spacy
import spacy_cleaner
from spacy_cleaner.processing import removers, replacers, mutators
model = spacy.load("en_core_web_sm")
Class Pipeline
allows for configurable cleaning of text using spaCy
. The
Pipeline
is initialised with a model and functions that transform spaCy
tokens:
pipeline = spacy_cleaner.Pipeline(
model,
removers.remove_stopword_token,
replacers.replace_punctuation_token,
mutators.mutate_lemma_token,
)
Next the pipeline
can be called with the method clean
to clean a list of
texts:
texts = ["Hello, my name is Cellan! I love to swim!"]
pipeline.clean(texts)
About the method clean
...
The method clean
is a wrapper around the spaCy
Language
class method
pipe
. Check the docs for more information:
Giving the output:
['hello _IS_PUNCT_ Cellan _IS_PUNCT_ love swim _IS_PUNCT_']
Makefile
contains a lot of functions for faster development.
1. Download and remove Poetry
To download and install Poetry run:
make poetry-download
To uninstall
make poetry-remove
2. Install all dependencies and pre-commit hooks
Install requirements:
make install
Pre-commit hooks can be installed after git init
via
make pre-commit-install
3. Codestyle
Automatic formatting uses pyupgrade
, isort
and black
.
make codestyle
# or use synonym
make formatting
Codestyle checks only, without rewriting files:
make check-codestyle
Note:
check-codestyle
usesisort
,black
anddarglint
library
Update all dev libraries to the latest version using one command
make update-dev-deps
4. Type checks
Run mypy
static type checker
make mypy
5. Tests with coverage badges
Run pytest
make test
6. All linters
Of course there is a command to rule run all linters in one:
make lint
the same as:
make test && make check-codestyle && make mypy
7. Cleanup
Delete pycache files
make pycache-remove
Remove package build
make build-remove
Delete .DS_STORE files
make dsstore-remove
Remove .mypy_cache
make mypycache-remove
Or to remove all above run:
make cleanup
You can see the list of available releases on the GitHub Releases page.
We follow Semantic Versions specification.
We use Release Drafter
. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when youβre ready. With the categories option, you can categorize pull requests in release notes using labels.
Label | Title in Releases |
---|---|
enhancement , feature |
π Features |
bug , refactoring , bugfix , fix |
π§ Fixes & Refactoring |
build , ci , testing |
π¦ Build System & CI/CD |
breaking |
π₯ Breaking Changes |
documentation |
π Documentation |
dependencies |
β¬οΈ Dependencies updates |
You can update it in release-drafter.yml
.
GitHub creates the bug
, enhancement
, and documentation
labels for you. Dependabot creates the dependencies
label. Create the remaining labels on the Issues tab of your GitHub repository, when you need them.
This project is licensed under the terms of the MIT
license. See LICENSE for more details.
@misc{spacy-cleaner,
author = {spacy-cleaner},
title = {Easily clean text with spaCy!},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Ce11an/spacy-cleaner}}
}
This project was generated with python-package-template
This project was built using IntelliJ IDEA