Malware and Phishing Link Detection

This project uses machine learning techniques to classify URLs as either malware or phishing based on the Malware URL dataset from Kaggle. We employ various Python libraries, including scikit-learn, XGBoost, LightGBM, and Seaborn, to perform data preprocessing, token analysis, and model training. In the end, we evaluate and showcase the model's accuracy.

Data Cleaning

The first step is to clean and preprocess the dataset. This involves handling missing values, removing duplicates, and converting the data into a format suitable for machine learning.

Common Tokens Analysis

We analyze the dataset to identify common tokens in the URLs. This analysis can provide insights into the characteristics of malware and phishing URLs.

Visualization with Word Cloud

We visualize the common tokens using a word cloud to gain a better understanding of the data. Word clouds provide a visual representation of the most frequent words or tokens in the dataset.

Model Training

We train several machine learning models to classify URLs, including:

XGBoost
LightGBM
Random Forest Classifier

The accuracy of each model is displayed while running, allowing you to assess their performance.

Running the Models

To train and evaluate the models, follow these steps:

Clone this repository:

git clone https://github.com/your-username/malware-phishing-detection.git
cd malware-phishing-detection

Install the required dependencies using pip:
```
pip install -r requirements.txt
```
Run the Jupyter Notebook or Python script to train and evaluate the models:
```
jupyter notebook malware_detection.ipynb
```
or
```
python malware_detection.py
```
View the model's accuracy and performance on the dataset.

Testing the Model

After training, you can test the models with your own URLs to determine if they are classified as malware or phishing.

Contributing

Contributions are welcome. If you'd like to contribute to this project, please open an issue or create a pull request.

ishanaudichya/malware-link-ml