
Text Topic Classifier

Primary LanguagePython

Text file topic classifier using tf-idf similarity and inverted index

  • This code was written for a class assignment from COSE471, 2018-1, Korea University.
  • I do not own any rights to the text content in the sample data.

What this code does

  • There are topic-labelled datasets in the "Data" folder.
  • There is a document named "input_document".
  • Label the input document with the most relevant topic.

How its done

  • Read text data from text files in "Data" folder.
  • Stem, tokenize them in the process, and then remove stopwords from the text.
  • Build an inverted index for the documents, and take only the files that have common word tokens with in the input document.
  • Compute the tf-idf similarity scores from the taken documents, and find the one with the highest scores.
  • Label the input document with the topic of the sample document with the highest similarity score.