This project uses machine learning techniques to classify URLs as either malware or phishing based on the Malware URL dataset from Kaggle. We employ various Python libraries, including scikit-learn, XGBoost, LightGBM, and Seaborn, to perform data preprocessing, token analysis, and model training. In the end, we evaluate and showcase the model's accuracy.
The first step is to clean and preprocess the dataset. This involves handling missing values, removing duplicates, and converting the data into a format suitable for machine learning.
We analyze the dataset to identify common tokens in the URLs. This analysis can provide insights into the characteristics of malware and phishing URLs.
We visualize the common tokens using a word cloud to gain a better understanding of the data. Word clouds provide a visual representation of the most frequent words or tokens in the dataset.
We train several machine learning models to classify URLs, including:
- XGBoost
- LightGBM
- Random Forest Classifier
The accuracy of each model is displayed while running, allowing you to assess their performance.
To train and evaluate the models, follow these steps:
-
Clone this repository:
git clone https://github.com/your-username/malware-phishing-detection.git cd malware-phishing-detection
-
Install the required dependencies using pip:
pip install -r requirements.txt
-
Run the Jupyter Notebook or Python script to train and evaluate the models:
jupyter notebook malware_detection.ipynb
or
python malware_detection.py
-
View the model's accuracy and performance on the dataset.
After training, you can test the models with your own URLs to determine if they are classified as malware or phishing.
Contributions are welcome. If you'd like to contribute to this project, please open an issue or create a pull request.