LT2212 V20 Assignment 2

Part 1

I split the news samples by spaces and lowercase all the words. Then I use isalpha() function to filter the words containing numbers or punctuations. In order to make the dataset smaller, I filter the words occur less than 5 times in the whole corpus. I get 18846 news samples and 29543 words ultimately.

Part 2

I use TruncatedSVD class doing dimensionality reduction.

Part 3

model_id #1: GaussianNB

model_id #2 : DecisionTreeClassifier

Part 4

100%(29543), 50%(14772), 25%(7386), 10%(2954), 5%(1477)

Algorithms	D-Reduction	Accuracy	Precision	Recall	F-measure
GaussianNB	1.0, 0.5, 0.25, 0.10, 0.05	0.6334, 0.1095, 0.1308, 0.1257, 0.1263	0.6525, 0.1706, 0.2024, 0.2460, 0.2755	0.6334, 0.1095, 0.1308, 0.1257, 0.1263	0.6367, 0.0853, 0.1065, 0.1051, 0.1126
DecisionTreeClassifier	1.0, 0.5, 0.25, 0.10, 0.05	0.4966, 0.1271, 0.1456, 0.1568, 0.1642	0.5022, 0.1290, 0.1490, 0.1583, 0.1675	0.4965, 0.1271, 0.1456, 0.1568, 0.1642	0.4967, 0.1274, 0.1468, 0.1572, 0.1653

My most time-consuming bug is that in part 1, when I filter the words which occur less than 5 times in my corpus, I forget to resize the index(value) of each unique word(key) in my dictionary, so I get a wrong values dictionary. Then in Part 3, I found both classifier training algorithms GaussianNB and DecisionTreeClassifier have a really bad performance, all the scores are less than ideal.

In general, GaussianNB has a better performance compared to DecisionTreeClassifier. After 50% features reducting, all the scores in both classifiers are dramatically decreased. However, all the reductions in different reducing levels share a similar scores.

yuanlanda/lt2212-v20-a2

LT2212 V20 Assignment 2

Part 1

Part 2

Part 3

Part 4