/weka_text_classification

naive bayes and svm classification implemented in weka

Primary LanguagePythonApache License 2.0Apache-2.0

Weka document classification

process overview Convert .txt files into weka accepted .arff files then classify the documents into four categories.

Results

classification results comparison

Dataset Information

The data set includes 2803 training and 1396 test text based and composed of webpages collected from the computer science departments of various universities and are classified into four categories (student, faculty, project, and course). The data set has already been preprocessed -- stop words removed and stemming performed.

Strategy

Project information as well as a guided walk through can be found in ./report.pdf.

Use

  • The python conversion is hard coded to read specific files in ./input/, convert them, then write the output to ./output/.
  • To use WEKA, please download weka from here. Then follow the highlighted instructions included in ./report.pdf

Note:

The conversion input expects a specified input. This could be easily modified in ./convert_txt_to_arff_specific.py. Notes about this input format are included in the report.pdf