Text Mining using 20 mini news groups
This project focuses on feature extraction, feature selection, classification, and clustering in the context of text mining.
The Project consists of the following four parts:
-
Part 1: Extracting features
-
Part 2: Classification
-
Part 3: Feature selection
-
Part 4: Document clustering
Based on the collection of the given documents, we implement the generation of feature vectors based on the Term Frequency(TF), Inverse Document Frequency(IDF) and Term Frequency - Inverse Document Frequency(TF-IDF) for each document. Based on the output of this, we perform classification based on four different classifiers, implement the feature selection using two different feature selection methods and finally use the k-Means and hierarchical clustering algorithms to implement the document clustering.