Implementation of a Machine Learning Pipeline for Exploitability Prediction

This repository contains datasets and scripts used to build the ML pipeline for my Bachelor's Degree thesis project.

Requirements

  • Python 3 (3.11.2 was used to work on the project and is guaranteed to work)

  • BeautifulSoup 4

  • NumPy

  • Pandas

  • Pickle

  • Requests

  • SciKit-Learn

  • Selenium

  • Tabulate

Repository structure

  • data/ contains all the datasets structured in the following subfolders:

    • exploitdb/ contains the final outputs of the scripts responsible for data mining from Exploit Database

    • nvd/ contains the raw JSON dump obtainable from NVD and the final outputs of the scripts responsible for data mining from this JSON and circl.lu / NVD APIs

    • merged/ contains the output of the merging of the two datasets

    • final/ contains the dataset the ML pipeline is going to use

  • scripts/ contains all the datasets structured in the following subfolders:

    • exploitdb/ contains all the scripts interfacing with Exploit Database.

      • scraper_multithreaded.py - web scraping from Exploit DB, with multithreading support for faster scraping.

      • scraper.py - first implementation of the web scraper, without multithreading support

      • dataframe.py - manipulates the dataset we obtained from the scraper, returning as a result the final Exploit Database dataset.

    • nvd/ contains all the scripts interfacing with Exploit Database.

      • parser_circl.py - collects data from circl.lu APIs for every single CVE available in the raw dump and returns an output dataset

      • parser_nvd.py - collects data from NVD APIs for every single CVE available in the raw dump and returns an output dataset

      • converter_circl.py - converts the output JSON to a CSV

      • converter_nvd.py - converts the output JSON to a CSV

    • merge/ contains all the scripts related to the merging of the datasets.

      • positives_count.py - returns the number of rows where exploitable (our target variable) is true

      • merger.py - merges the datasets obtained from Exploit Database and NVD/circl.lu to obtain the final dataset the ML pipeline is going to run on.

      • metrics_are_na.py - used for data cleaning, returns the number of rows where the CVSS metrics are NA.

    • ml_pipeline/ contains all the scripts related to the actual machine learning pipeline and its configuration steps and metrics collection.

      • ml_pipeline.py - script that runs the ML pipeline.
      • models.py - includes functions for hyperparameter tuning, baseline scoring and samplers scoring