JackBurdick/cosine_similarity_tfidf_nltk

calculate tfidf and cosine similarity using nltk

Jupyter NotebookApache-2.0

Calculate TFIDF and Cosine Similarity

Overview

Preprocess articles (word tokenize, remove stop words, remove punctuation, conduct stemming*)
Calculate tf-idf for each term
Calculate pairwise cosine similarity for the documents

*Porter stemming was used for stemming

How to use

place cosine_similarity_tfidf_nltk.py in a directory at the same level as inputdata/
run python cosine_similarity_tfidf_nltk.py NOTE: you may need to install NLTK and download some of it's packages. You can do this by running a python script, importing nltk, then calling nltk.download() which will open a GUI. This script is not intended for many or large files.

Source Code

main source file can be found /cosine_similarity_tfidf_nltk.py
step-by-step jupyter notebook

Input information

input files were assigned and can be found /inputdata

Results

results can be viewed /results
stepwise preprocessing results
tf-idf results
pairwise cosine_similarity results