Keyword-Similarity-Check
A very simple experiment to implement spacy
to remove contextually similar keywords from a list.
Install
- Install spacy
pip install spacy
- Download a trained spacy model. Spacy currently has several models for several languages, as can be seen here - https://spacy.io/models/en
The code currently uses en_core_web_lg - https://spacy.io/models/en#en_core_web_lg - a 382 MB large language model.
To install python -m spacy download en_core_web_lg
If you decide to go with a different model, be sure to load it appropriately in line 9 of the script.
-
Add your keywords in a single column CSV file called
keywords.csv
-
Let the script run, the output will be stores in
unique.csv
How it works
-
Uses spacy's
token.lemma_
to lemmatize the keywords (this is a trainable metric, so I need to play around with this - in it's present form it uses the default lemmatization) -
Uses the
combination
function fromitertools
to pair keywords - then makes them run a check against each others with similarity set to > 0.95 -
Reads file
keywords.csv
- removes contextually duplicate keywords and writes to fileunique.csv