This repository contains a Jupyter Notebook that automates the process of converting Common Vulnerabilities and Exposures (CVE) data from the National Vulnerability Database (NVD) into a format suitable for training language models like LLAMA.
Google Colab Link: CVE_to_LLAMA.ipynb (Open in Google Colab)
The notebook performs the following tasks:
- Downloads CVE data from the NVD JSON feeds for multiple years.
- Processes the downloaded JSON files and extracts relevant information such as CVE ID, description, references, and exploit code (if available).
- Maps CVE IDs to OSVDB IDs using a mapping file obtained from the MITRE Corporation.
- Creates a Parquet file (
data.parquet
) containing the extracted data. - Converts the Parquet file into a JSONL file (
train.jsonl
) for training language models.
The following output files are generated:
-
data.parquet
(Parquet file): This file contains the extracted data in a columnar format. The columns include CVE ID, Description, References, ExploitDB URL, OSVDB ID, and Prompt. -
train.jsonl
(JSON Lines file): This file is suitable for training language models like LLAMA. Each line in the file represents a JSON object with the following structure:
{
"instruction": "Explain <CVE_ID> and provide an exploit if one is available",
"input": "Explain <CVE_ID>",
"output": "<Description>\nReferences: <References>\n[Exploit code if available]"
}
output_llama.jsonl
(JSON Lines file): This file is an alternative format for training language models like LLAMA. It contains the same information as train.jsonl but in a slightly different structure.
Clone the repository or download the CVE_to_LLAMA.ipynb file. Open the notebook in Google Colab or a local Jupyter environment. Run the notebook cells sequentially to execute the data processing and conversion steps. The output files (data.parquet, train.jsonl, and output_llama.jsonl) will be generated in the same directory as the notebook.
Note: The notebook assumes that the exploitdb folder and osvdb_cve_mapping.csv file are present in the same directory. These files are used to map CVE IDs to exploit code and OSVDB IDs, respectively.