URL Classifier using NLP and Machine Learning

Overview
Project Deployment
Usage
Local Device Installation
Database for Feedback
Training Data Collection
Data Modeling and Model Architecture
Model Performance
Contributing
Acknowledgments

Overview

The URL Classifier is a powerful machine learning application designed to classify URLs as safe or malicious. In today's cybersecurity landscape, identifying potentially harmful URLs is crucial in mitigating cyber threats and phishing attacks. This project leverages both Natural Language Processing (NLP) and lexical features to create a robust URL classifier. It is hosted on Huggingface Space and utilizes the Gradio interface for user interaction.

Project Deployment

The project is deployed on Huggingface space using gradio interface, which allows users to enter a URL and get a prediction as well as an explanation of why the prediction was made. Users can also flag false results, which will store the URL and its correct type (malicious or safe) in a MySQL database. This way, the model can learn from user feedback and improve over time. Access the live project here.

Usage

To utilize this project, you have two options:

Visit the hosted project here and use it through web-interface or API endpoints.
Run the project locally using the installation steps provided below.

Once you have access, follow these steps:

Input the URL you want to classify.
The classifier will predict whether the URL is safe or malicious.
If the prediction is incorrect, you can flag it for further analysis, and the URL will be stored in the database.

Local Device Installation

REQUIREMENTS

To use this project effectively, ensure you have the following prerequisites:

Python 3.8 or higher
mysql-connector-python 8.1.0
numpy 1.23.5
pandas 1.5.3
scikit_learn 1.2.2
tensorflow 2.12.0
nltk 3.8.1
gradio 3.40.1

Installation

To set up this project, follow these steps:

Clone this repository to your local machine using the following command:
```
git clone https://github.com/Munna0912/URL_CLASSIFIER.git
```
Create a virtual environment:
```
python -m venv env
```
Activate the virtual environment:
- On Windows:
```
env\Scripts\activate
```
- On Linux/MacOS:
```
source env/bin/activate
```
Install the required packages:
```
pip install -r requirements.txt
```
Create a MySQL database(url_classifier) with the following credentials and update these details in app.py:
- Host: localhost
- User: root
- Password: root
- Database: url_classifier
Run the create_table.sql script at your MySQL server to create the table for storing URLs and their classifications.
Launch the web app using the following command:
```
python app.py
```

Database for Feedback

Feedback data is stored and retrieved from a MySQL database provided by www.freesqldatabase.com. This free service offers a 5MB MySQL database for data management.

Training Data Collection

The data for this project was meticulously collected from various sources:

Used for Training:

Used for Testing:

Phishing_Site_URLs_Kaggle Dataset

For comprehensive insights into data processing, please refer to the "Data Processing Notebook."

Data Modeling and Model Architecture

Our URL Classifier project employs a two-pronged approach to URL classification:

NLP-based Model: This model harnesses the power of N-Gram techniques to identify patterns in URLs. Specifically, it uses 3-Gram (Character-Gram) vectorization. The N-Gram model is adept at recognizing subtle patterns in URLs often associated with malicious intent, such as direct IP addresses or keywords like "pay," "offer," "OTP," and more.
Lexical Features Model: This model is based on a set of 18 lexical features associated with URLs. These features include whether the URL has an IP address, the presence of "http" or "https," URL length, the count of dots (.), the count of "www," and more. These features contribute to the model's ability to differentiate between safe and malicious URLs.

The features used for the lexical features method are:

having_ip_address: Whether the URL includes an IP address or not
abnormal_url: Whether the URL is in proper formatting or not
count_dot: The number of dots (.) in the URL
count_www: The number of "www" in the URL
count_atrate: The number of "@" in the URL
no_of_dir: The number of directories in the URL
no_of_embed: The number of "/" in the URL
count_https: The number of "https" in the URL
count_http: The number of "http" in the URL
count_percent: The number of "%" in the URL
count_ques: The number of "?" in the URL
count_hyphen: The number of "-" in the URL
count_equal: The number of "=" in the URL
Length of URL
Hostname Length
Count of Digits
Count of Alpha-Numerical Characters
Length of First Directory

The two models are merged as a TensorFlow model, which takes both inputs and outputs a final prediction based on a weighted average of the two scores.

The model performance is evaluated using accuracy, precision, recall, and F1-score metrics.
For more information, see the modules in Utilities and the URL Classification Paper.

Model Performance

The performance of our machine learning model is impressive:

Testing Data Accuracy: 98.4%
Unseen Dataset Accuracy: 71.1%

These results underscore the model's ability to effectively classify URLs, which is critical for cybersecurity. For a comprehensive understanding of how the model is trained and validated, refer to the "Model Training Notebook."

Contributing

I welcome contributions to enhance the accuracy and functionality of the URL Classifier project. Here are some ways you can contribute:

Data: If you have additional datasets or sources of URL data that can enhance the model's training, please share them with me.
Model Improvements: If you have ideas or techniques to improve the model's performance, feel free to contribute code or suggestions.
Feedback: Use the project interface to provide feedback on false URL classifications to help me refine the model.

To contribute, please refer to the project's GitHub repository here.
If you have any questions or feedback about this project, you can contact me at munna0912@gmail.com or connect with me on LinkedIn.

Munna Ram - Project Lead and Developer

Acknowledgments

I would like to extend my gratitude to the following entities and communities:

Kaggle for providing valuable datasets.
URLhause, Moz, Tranco, Cisco Umbrella, DomCop, and Majestic Million for their data sources.
Gradio for their framework that powers the web interface.
HuggingFace for providing free resources to deploy the project.
Free Sql Database services for providing MySQL datbase server.

Your contributions and support are integral to the success of this project. Thank you for being part of my effort to enhance cybersecurity through URL classification.

Feel free to explore the project, provide feedback, to make the internet a safer place!

munna0912/URL_CLASSIFIER