This project demonstrates how to implement a Naive Bayes algorithm for text classification using Python and scikit-learn. The classifier categorizes social media posts, news articles, or NGO reports into categories such as human rights or sustainability, etc.
The dataset used in this project is the "Categorized News Articles" dataset available on Kaggle. You can download the dataset here.
Install the following Python libraries:
- numpy
- pandas
- scikit-learn
- matplotlib
- seaborn
You can install them using pip:
pip install numpy pandas scikit-learn matplotlib seaborn
- Load the dataset using pandas and explore its structure
- Preprocess the dataset by combining the title and short_description columns, filtering relevant categories, and encoding categories into numerical labels
- Split the dataset into training and testing sets
- Vectorize the text data using the Bag of Words model with TF-IDF weighting
- Train the Naive Bayes classifier using the training data
- Evaluate the classifier using accuracy, precision, recall, and F1-score
- Visualize the results using a confusion matrix
- Download the dataset from Kaggle and place it in the project directory
- Run the Python script containing the implementation to train and evaluate the classifier
To improve the classifier's performance, consider the following:
- Use a more complex model, such as Logistic Regression, Support Vector Machines, or Deep Learning-based models like BERT
- Resample the dataset to balance the categories, either by oversampling the minority class or undersampling the majority class
- Perform more advanced preprocessing, like stemming or lemmatization, to reduce the dimensionality of the text features
- Use other feature extraction techniques, such as word embeddings (e.g., Word2Vec, GloVe) or contextual embeddings (e.g., BERT, RoBERTa)