A range of packages are needed to run this system. It requires the latest version of:
- Natural Language Toolkit (NLTK) (including the NLTK corpus)
- GenSim
- Numpy
- Scikit-Learn
- Beautiful Soup 4
- Scipy
To run the system, use
python load_data.py A B C D
A is the mode:
- 1 - Bag of words (unigram) feature testing
- 2 - Bigram feature testing
- 3 - Trigram feature testing
- 4 - Bag of words (unigram) topic model feature testing
- 5 - Bigram topic model feature testing
- 6 - Trigram topic model feature testing
- 7 - Bag of words (unigram) clustering
- 8 - Bag of words (unigram) test classification
Note that option 7 runs and evaluates all three clustering algorithms
B is the classification mode, indicating the type of classifier to use for that feature mode, and only applies to options 1-6:
- 1 - Naive Bayes classifier
- 2 - Decision Tree classifier
- 3 - Random Forests classifier
Each is run with via K-fold cross-validation, where k=10
C indicates whether to limit data loading:
- 1 - Limit data loading
- 0 - Do not limit data loading
This is used in conjunction with D, which is the number of Reuters sub-files to load, from 1-21. If 0 is passed as C, this parameter should take the value 0.
- Options C and D are largely for debugging purposes.
- The system is likely to take a long time to complete its task, especially if it is run without loading limits. Expect multiple hours to run the ngram classifiers. Topics models are typically much faster.
- wrapper.py provides a wrapper to run modes 1 to 6 with all three classifiers. This takes a long time to run!
- Output during runtime is inserted into output.txt in the scripts directory.