Implementation of a Machine Learning Pipeline for Exploitability Prediction

This repository contains datasets and scripts used to build the ML pipeline for my Bachelor's Degree thesis project.

Requirements

Python 3 (3.11.2 was used to work on the project and is guaranteed to work)
BeautifulSoup 4
NumPy
Pandas
Pickle
Requests
SciKit-Learn
Selenium
Tabulate

Repository structure

data/ contains all the datasets structured in the following subfolders:
- exploitdb/ contains the final outputs of the scripts responsible for data mining from Exploit Database
- nvd/ contains the raw JSON dump obtainable from NVD and the final outputs of the scripts responsible for data mining from this JSON and circl.lu / NVD APIs
- merged/ contains the output of the merging of the two datasets
- final/ contains the dataset the ML pipeline is going to use
scripts/ contains all the datasets structured in the following subfolders:
- exploitdb/ contains all the scripts interfacing with Exploit Database.
  - scraper_multithreaded.py - web scraping from Exploit DB, with multithreading support for faster scraping.
  - scraper.py - first implementation of the web scraper, without multithreading support
  - dataframe.py - manipulates the dataset we obtained from the scraper, returning as a result the final Exploit Database dataset.
- nvd/ contains all the scripts interfacing with Exploit Database.
  - parser_circl.py - collects data from circl.lu APIs for every single CVE available in the raw dump and returns an output dataset
  - parser_nvd.py - collects data from NVD APIs for every single CVE available in the raw dump and returns an output dataset
  - converter_circl.py - converts the output JSON to a CSV
  - converter_nvd.py - converts the output JSON to a CSV
- merge/ contains all the scripts related to the merging of the datasets.
  - positives_count.py - returns the number of rows where exploitable (our target variable) is true
  - merger.py - merges the datasets obtained from Exploit Database and NVD/circl.lu to obtain the final dataset the ML pipeline is going to run on.
  - metrics_are_na.py - used for data cleaning, returns the number of rows where the CVSS metrics are NA.
- ml_pipeline/ contains all the scripts related to the actual machine learning pipeline and its configuration steps and metrics collection.
  - ml_pipeline.py - script that runs the ML pipeline.
  - models.py - includes functions for hyperparameter tuning, baseline scoring and samplers scoring

meelunae/exploitability_prediction

Implementation of a Machine Learning Pipeline for Exploitability Prediction

Requirements

Repository structure