This project aims to classify questions into different categories as part of a chatbot application. The code is written in Python and uses various libraries and machine learning techniques.
The project begins with data preparation, where you load the labeled questions from a CSV file using Pandas.
The data preprocessing step is crucial for text data. The Hazard Motevasel library is used for text preprocessing. The following steps are performed:
- Special character removal
- Tokenization
- Normalization
- Removing stopwords
In this step, the code creates an occurrence table for words in the dataset. This table keeps track of how many times each word occurs in each class (label).
The Naive Bayes Classifier is responsible for predicting the label (category) of a given question. It uses Bayes' Theorem to calculate the probability of a question belonging to each label.
In the validation step, k-fold cross-validation is used to assess the classifier's performance. Precision, recall, and F1-score are calculated for each fold, and the average scores are reported.
The prediction component is used to classify a set of new questions. The code reads new questions, preprocesses them, and uses the trained Naive Bayes Classifier to predict the labels for each question. The results are saved to a CSV file.
To use this project for your own question classification task, follow these steps:
-
Prepare Your Data: Create a CSV file containing labeled questions. Ensure it contains a 'query' column for the text of the questions and a 'label' column for the corresponding labels.
-
Modify Data Loading: Update the code to load your dataset by replacing the file path in the
pd.read_csv
function. -
Run Preprocessing: Review the preprocessing steps in the
Preprocess
class. You can modify the special character removal, stop words, or any other text processing steps according to your dataset's requirements. -
Train the Model: You can adjust the k-fold validation settings in the
k_fold_validation
function to fit your dataset's needs. Once the model is trained, it will provide average precision, recall, and F1-score. -
Make Predictions: Modify the 'prediction_result_to_csv' function to read and preprocess your new set of questions from a CSV file. Ensure that the 'condition_prob' and 'words' are calculated based on your training data. After making predictions, the results will be saved to a CSV file.
The project provides metrics to assess the model's performance, including precision, recall, and F1-score. You can use these metrics to evaluate how well the model classifies questions into different categories.
This project is a robust starting point for building a question classification chatbot. It demonstrates the implementation of a Naive Bayes classifier, data preprocessing, and validation techniques. You can further enhance the model by optimizing preprocessing steps, tuning hyperparameters, and increasing the size and diversity of the training dataset.