Spam Detector Using NLP

This project demonstrates how to build an intelligent system to classify emails as Spam or Not Spam using Machine Learning and Natural Language Processing (NLP). It serves as a step-by-step guide for students to understand and implement such a system.

What Will You Learn?

Through this project, you’ll gain hands-on experience in:

Preprocessing textual data using NLP techniques.
Extracting meaningful numerical features from text with TF-IDF Vectorization.
Training a Machine Learning model (Naive Bayes) for classification tasks.
Building an interactive web-based application using Streamlit.

How Does It Work?

The system classifies an email message into two categories:

Spam: Unwanted or promotional emails.
Not Spam: Important, relevant emails.

This process involves:

Text Preprocessing: Cleaning and preparing text for analysis.
Feature Extraction: Converting text into numerical data using TF-IDF.
Model Training: Using a Naive Bayes classifier to detect patterns.
Interactive Interface: Analyzing new emails via a web app.

Project Workflow

Dataset

What’s the dataset?
- The SMS Spam Collection Dataset (labeled SMS messages).
How to get it?
- Download from Kaggle.
File location:
- Save the file as spam.csv in the data/ directory.

Text Preprocessing

Before training the model, we clean the text:

Lowercasing: Converts characters to lowercase.
Removing Special Characters: Strips symbols, numbers, and extra spaces.
Stopword Removal: Removes common words like “is”, “the”, “and” using NLTK.
Lemmatization: Reduces words to their base form (e.g., “running” → “run”).

Why preprocessing?

It reduces noise.
Ensures focus on meaningful words.

Feature Extraction with TF-IDF

What is TF-IDF?
- A technique transforming text into numerical values based on:
  - TF: Term Frequency — how often a word appears.
  - IDF: Inverse Document Frequency — how unique a word is.
Why use it?
- It prioritizes relevant words over common ones.
Example:
- In the phrase "Win a free iPhone now!", words like "Win" and "free" get higher weights than "a" or "now".

Model Training

Which model?
- Naive Bayes Classifier — fast, simple, and effective for text classification.
Why Naive Bayes?
- Works well with text.
- Calculates probabilities for each class.

Interactive Interface with Streamlit

What is Streamlit?
- A Python library for creating interactive web apps.
What does it do?
- Lets users input an email and see if it’s Spam or Not Spam, with a confidence score.

Directory Structure

spam_detector/
├── data/
│   └── spam.csv          # Dataset
├── model/
│   ├── spam_model.joblib # Trained model
│   └── vectorizer.joblib # TF-IDF vectorizer
├── src/
│   ├── train_model.py    # Script to train the model
│   └── predict.py        # Script to make predictions
├── GUI/
│   └── main.py           # Streamlit app
├── requirements.txt      # Required libraries
└── README.md             # Project documentation

Step-by-Step Setup

Clone the Repository

git clone https://github.com/your_username/spam-detector-nlp.git
cd spam-detector-nlp

Create a Virtual Environment

python -m venv .venv

Activate the Virtual Environment

Windows:
```
.venv\Scripts\activate
```
macOS/Linux:
```
source .venv/bin/activate
```

Install Dependencies

pip install -r requirements.txt

Usage Instructions

Train the Model

Run the training script:

python src/train_model.py

This generates:

spam_model.joblib (trained model).
vectorizer.joblib (TF-IDF vectorizer).

Run the Streamlit App

Start the web app:

streamlit run GUI/main.py

Analyze Emails

Open the link (e.g., http://localhost:8501).
Input an email and click Analyze Email.

Example Emails

Spam Email:

Congratulations! You’ve won $1,000,000! Click here to claim now!

Prediction: Spam
Confidence Score: 95%

Not Spam Email:

Hi John, can we reschedule our meeting to tomorrow at 2 PM?

Prediction: Not Spam
Confidence Score: 99%

Key Technologies

Python: The programming language.
Libraries:
- Streamlit: For the web interface.
- Scikit-learn: For the ML model.
- NLTK: For text preprocessing.
- Joblib: For saving/loading models.

How the System Works

Preprocessing: Clean and standardize the input text.
Feature Extraction: Convert text into numerical data.
Training: Train the Naive Bayes model.
Prediction: Classify new emails.

Future Enhancements

Support for Additional Languages.
Advanced Models: Experiment with deep learning models.
Batch Classification: Process multiple emails simultaneously.

Learning Outcomes

End-to-end understanding of text classification.
Experience with data preprocessing and feature extraction.
Deployment skills with Streamlit.

Credits

Made with ❤️ by Amr Alkhouli

License

This project is licensed under the MIT License.

amrpyt/Spam-Detector-using-NLP