/CVE-to-LLAMA

Scrape CVE's into a trainable dataset for LLM

Primary LanguageJupyter Notebook

CVE-to-LLAMA

This repository contains a Jupyter Notebook that automates the process of converting Common Vulnerabilities and Exposures (CVE) data from the National Vulnerability Database (NVD) into a format suitable for training language models like LLAMA.

Google Colab Link: CVE_to_LLAMA.ipynb (Open in Google Colab)

Description

The notebook performs the following tasks:

  1. Downloads CVE data from the NVD JSON feeds for multiple years.
  2. Processes the downloaded JSON files and extracts relevant information such as CVE ID, description, references, and exploit code (if available).
  3. Maps CVE IDs to OSVDB IDs using a mapping file obtained from the MITRE Corporation.
  4. Creates a Parquet file (data.parquet) containing the extracted data.
  5. Converts the Parquet file into a JSONL file (train.jsonl) for training language models.

Output Files

The following output files are generated:

  1. data.parquet (Parquet file): This file contains the extracted data in a columnar format. The columns include CVE ID, Description, References, ExploitDB URL, OSVDB ID, and Prompt.

  2. train.jsonl (JSON Lines file): This file is suitable for training language models like LLAMA. Each line in the file represents a JSON object with the following structure:

{
  "instruction": "Explain <CVE_ID> and provide an exploit if one is available",
  "input": "Explain <CVE_ID>",
  "output": "<Description>\nReferences: <References>\n[Exploit code if available]"
}
  1. output_llama.jsonl (JSON Lines file): This file is an alternative format for training language models like LLAMA. It contains the same information as train.jsonl but in a slightly different structure.

Usage

Clone the repository or download the CVE_to_LLAMA.ipynb file. Open the notebook in Google Colab or a local Jupyter environment. Run the notebook cells sequentially to execute the data processing and conversion steps. The output files (data.parquet, train.jsonl, and output_llama.jsonl) will be generated in the same directory as the notebook.

Note: The notebook assumes that the exploitdb folder and osvdb_cve_mapping.csv file are present in the same directory. These files are used to map CVE IDs to exploit code and OSVDB IDs, respectively.