This analysis focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem. In order to consider each of these components, making use of the Pipeline
and GridSearchCV
objects in scikitlearn to try different combinations of vectorizers with different estimators. For each of these, also using the .cv_results_
to examine the time for the estimator to fit the data.
The dataset is from kaggle and contains a dataset named the "ColBert Dataset" created for this paper. Using the text column to classify whether or not the text was humorous.
Note: The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.
As a pre-processing step, performed both stemming
and lemmatizing
to normalize the text before classifying. For each technique used both the CountVectorize
r and TfidifVectorizer
and used options for stop words and max features to prepare the text data for the estimator.
Once the text data is prepared with stemming lemmatizing techniques, Used LogisticRegression
, DecisionTreeClassifier
, and MultinomialNB
as classification algorithms for the data. Compared their performance in terms of accuracy and speed.
Consistently outperform the others, especially when using stemming with count vectorization. These models offer the best balance of precision, recall, F1, and AUC, making them well-suited for text classification tasks that require accurate class identification and discrimination. Although stemming slightly improves metrics, it is more computationally expensive compared to lemmatizing.
Exhibits strong recall and F1 scores but fall short of logistic regression across all metrics.
Delivered competitive performance with higher computational costs and lower metrics compared to logistic regression.
Overall, the logistic_stem_count
model stands out due to its robust performance across all key metrics. However, if computational efficiency is a concern, the logistic_lemmatize_count
model is a viable alternative.
Best hyperparameters are listed as below:
text_data/dataset.csv
: Contains dataset used in the analysis.images/
: Contain metrics comparison charts.notebooks/NLP-Humor-Classifier.ipynb
: Jupyter notebook with code for data analysis.README.md
: Summary of findings and link to notebook
The detailed analysis and code can be found in the Jupyter notebook here.