A range of packages are needed to run this system. It requires the latest version of:

Natural Language Toolkit (NLTK) (including the NLTK corpus)
GenSim
Numpy
Scikit-Learn
Beautiful Soup 4
Scipy

To run the system, use

    python load_data.py A B C D

Parameters

A is the mode:

1 - Bag of words (unigram) feature testing
2 - Bigram feature testing
3 - Trigram feature testing
4 - Bag of words (unigram) topic model feature testing
5 - Bigram topic model feature testing
6 - Trigram topic model feature testing
7 - Bag of words (unigram) clustering
8 - Bag of words (unigram) test classification

Note that option 7 runs and evaluates all three clustering algorithms

B is the classification mode, indicating the type of classifier to use for that feature mode, and only applies to options 1-6:

1 - Naive Bayes classifier
2 - Decision Tree classifier
3 - Random Forests classifier

Each is run with via K-fold cross-validation, where k=10

C indicates whether to limit data loading:

1 - Limit data loading
0 - Do not limit data loading

This is used in conjunction with D, which is the number of Reuters sub-files to load, from 1-21. If 0 is passed as C, this parameter should take the value 0.

Notes

Options C and D are largely for debugging purposes.
The system is likely to take a long time to complete its task, especially if it is run without loading limits. Expect multiple hours to run the ngram classifiers. Topics models are typically much faster.
wrapper.py provides a wrapper to run modes 1 to 6 with all three classifiers. This takes a long time to run!
Output during runtime is inserted into output.txt in the scripts directory.

tangohead/CS909-Project

Parameters

Notes