AG-News-Sentence-Classification

AG News Sentence Classification - First of all loaded the dataset and had a look around its basic attributes such as unique values, empty values, class distribution etc. Then added the names for the columns which were absent in the dataset. Classes were equally balanced and there were a total no. of 4 classes. Then created some word features which included average word per article/title, character length and word count. Then analysed these features by classes. The mean of average word of the descriptions was highly differentiable among the classes, there is a plot about this in the notebook. Compared these features for the title as well but the results were not so promising. Then applied cleaning techniques to the title and description. The cleaning involved removal of punctuation marks and numbers, lowercasing the letters and lemmatization(converting various forms of the word into the root word). The algorithm which i chose was naive bayes classifier as i had too less of time to experiment with hyperparameters of some powerful algorithms such as svm. Used term-frequency inverse-doc-frequency vectorization scheme with discarding those words which were present in more than 30 percent of the samples. It only seems logical as one class was 25 % of the samples. And words present in too much samples holds no significance. I increased the weights of titles by multiplying them with 5 as this problem was a classification of genre of the article and headlines alone can do this some of the time. This scheme showed me some promising results in a previous problem i encountered which was similar to this. The only difference being that problem involved clustering. The number of words i selected after a bit of an experimentation and the final result which i was able to obtain was 85.68 % with max words of 400 in title and 5000 words in the description portion for the vectorization scheme.