- Preprocess articles (word tokenize, remove stop words, remove punctuation, conduct stemming*)
- Calculate tf-idf for each term
- Calculate pairwise cosine similarity for the documents
*Porter stemming was used for stemming
- place
cosine_similarity_tfidf_nltk.py
in a directory at the same level asinputdata/
- run
python cosine_similarity_tfidf_nltk.py
NOTE: you may need to install NLTK and download some of it's packages. You can do this by running a python script, importing nltk, then callingnltk.download()
which will open a GUI. This script is not intended for many or large files.
- main source file can be found /cosine_similarity_tfidf_nltk.py
- step-by-step jupyter notebook
- input files were assigned and can be found /inputdata