/cosine_similarity_tfidf_nltk

calculate tfidf and cosine similarity using nltk

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Calculate TFIDF and Cosine Similarity

Overview

  1. Preprocess articles (word tokenize, remove stop words, remove punctuation, conduct stemming*)
  2. Calculate tf-idf for each term
  3. Calculate pairwise cosine similarity for the documents

*Porter stemming was used for stemming

How to use

  1. place cosine_similarity_tfidf_nltk.py in a directory at the same level as inputdata/
  2. run python cosine_similarity_tfidf_nltk.py NOTE: you may need to install NLTK and download some of it's packages. You can do this by running a python script, importing nltk, then calling nltk.download() which will open a GUI. This script is not intended for many or large files.

Source Code

Input information

  • input files were assigned and can be found /inputdata

Results