News-Aggregator

Objective:

To predict the category of news article on the basis of title.

Dataset:

The data is the UCI News Aggregator dataset (https://archive.ics.uci.edu/ml/datasets/News+Aggregator) which contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories included in this dataset include business; science and technology; entertainment; and health. Different news articles that refer to the same news item (e.g., several articles about recently released employment statistics) are also categorized together.

Description:

Dataset is loaded by importing a csv file using Pandas and load into data frame.
Stop words are removed from title and stemming is performed to reduce a word to its stem. Then a string containing important information about title is returned.
Labels are encoded with value between 0 and n_classes-1 as they are in form of characters
These features and labels are stored in a pickle file.
After loading the pickle files, features and labels are split into training and testing data.
Now training and testing features are pulled into vectors using TfidfVectorizer.
Due to high dimensional input, feature selection is applied on training and testing features which selects 10% of features that are most powerful.
GridSearchCV is used to identify best parameters for our classifiers.
For validation purpose, overfitting to the training set or a data leak is checked. This is acheived by using cross_val_score that estimates the accuracy of a model on the dataset by splitting the data, fitting a model and computing the score 10 consecutive times (with 10 different splits each time).
Final quantitative evaluation of best model is done on testing dataset.

Note : Here we apply cross_val_score on both training and transformed features(features after using feature selection).

Observation:

Table for total training features:

Classifier	Best Parameters	Training time(s)	Accuracy(%)
Multinomial Naive Bayes	alpha = 0.1	0.116	92.7
Logistic regression	C = 1, multi_class = multinomial, solver = newton-cg	23.688	94.29
AdaBoost	n_estimators = 80, learning_rate = 0.5	21534.235	88.62
Random Forest	criterion = gini, n_estimators = 100, max_features = sqrt	37419.736	92.13

Table for transformed features:

Classifier	Best Parameters	Training time(s)	Accuracy(%)
Multinomial Naive Bayes	alpha = 0.1	0.108	91.31
Logistic regression	C = 1, multi_class = multinomial, solver = newton-cg	14.739	92.79
AdaBoost	n_estimators = 50, learning_rate = 1	34.28	59.75
Random Forest	criterion = gini, n_estimators = 10, max_features = sqrt	119.375	92.01

Conclusions:

Accuracy for random forest on both types of data is approximately same but it takes almost 10 hours to train total features whereas it takes just 2 minutes for transformed data.
Multinomial Naive Bayes and Logistic regression have same parameters for both types of features.
Accuracy for AdaBoost increases significantly if we increase number of features but training time of total features is huge.
Accuracy for Logistic regression increases by 2% when we train on entire training dataset as the training time difference between 2 types of features is only 9 seconds.

Result:

Logistic regression gives highest accuracy of 94.29% when we use StratifiedKFold on training dataset. It gives 94.15% accuracy when we run on testing dataset.