You'll need to set up a Python environment and install libraries commonly used in machine learning.
You don't need all of them to run this file. The following command is all you need:
pip install numpy pandas scikit-learn
- Import necessary libraries for data manipulation, machine learning, and evaluation.
- Load the dataset from a CSV file into a pandas DataFrame.
- Replace null values in the DataFrame with empty strings.
- Encode the 'Category' column labels: 'spam' as 0 and 'ham' as 1.
- Separate the feature (messages) and the target (labels).
- Split the data into training and test sets.
- Transform the text data into TF-IDF feature vectors.
- Train a logistic regression model using the training data.
- Use the trained model to make predictions on the test data.
- Evaluate the model's performance using a confusion matrix and accuracy score.
147 49
0 1197
True Negatives (TN) : 147
False Positives (FP) : 49
False Negatives (FN) : 0
True Positives (TP) : 1197
0.964824120603015
An accuracy score of 96.48% means that the model correctly predicted the category of emails about 96.48% of the time.