FormAI Dataset

Description

FormAI Dataset is a large collection of AI-generated compilable and independent C programs with vulnerability classification. Currently, there are two versions available: FormAI-v1, consisting of 112,000 compilable C programs generated by GPT-3.5-turbo, and FormAI-v2, recently released, which includes 265,000 compilable C programs generated utilizing various LLMs such as Google's GEMINI-pro, OpenAI's GPT-4, TII's 180 billion-parameter Falcon, CodeLLama2, and other compact models. These programs generated using dynamic zero-shot prompting technique and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Each program is labelled based on vulnerabilities present in the code using a formal verification method based on the Efficient SMT-based Bounded Model Checker (ESBMC). This strategy conclusively identifies vulnerabilities without reporting false positives (due to the presence of counter examples), or false negatives (up to a certain bound). The labeled samples can be utilized to train Large Language Models (LLMs) since they contain the exact program location of the software vulnerability.

Cite

FormAI-v1:

The original FromAI-v1 paper with 112,000 samples was presented at the 19th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2023) on December 8, 2023, in San Francisco, CA, USA.

Cite paper (ACM digital library): https://dl.acm.org/doi/10.1145/3617555.3617874

@inproceedings{10.1145/3617555.3617874,
author = {Tihanyi, Norbert and Bisztray, Tamas and Jain, Ridhi and Ferrag, Mohamed Amine and Cordeiro, Lucas C. and Mavroeidis, Vasileios},
title = {The FormAI Dataset: Generative AI in Software Security through the Lens of Formal Verification},
year = {2023},
isbn = {9798400703751},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3617555.3617874},
doi = {10.1145/3617555.3617874},
booktitle = {Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering},
pages = {33–43},
numpages = {11},
keywords = {Artificial Intelligence, Dataset, Formal Verification, Large Language Models, Software Security, Vulnerability Classification},
location = {<conf-loc>, <city>San Francisco</city>, <state>CA</state>, <country>USA</country>, </conf-loc>},
series = {PROMISE 2023}
}

FormAI-v2:

The FormAI-v2 with 265,000 C samples just released.

Cite paper (arXiv): https://arxiv.org/abs/2404.18353

Architecture

The design and methodology behind the creation of the FormAI-v2 dataset are depicted in the subsequent figure.

Guidelines for Getting Started with FormAI-v2

The FormAI-v2 dataset is larger and includes more vulnerabilities compared to FormAI-v1. Moreover, it includes AI-generated C code not only from GPT-3.5 as in FormAI-v1 but from eight different models, such as CodeLLAM2, GEMINI-pro, GPT4, and Falcon-180B. Therefore, we encourage the research community to switch from FormAI-v1 to FormAI-v2. Furthermore, FormAI-v2 includes additional metrics that can be utilized in machine learning classification, such as cyclomatic complexity, which helps in understanding the code complexities of each generated program. The new dataset uses the more comprehensive JSON format, unlike FormAI-v1 which used CSV. Here is an example of how a JSON entry looks in the dataset:

{
   "category": "VULNERABLE",
   "file_name": "gpt35-58832.c",
   "verification_finished": "yes",
   "vulnerable_line": 42,
   "column": 9,
   "function": "sanitize_url",
   "violated_property": "file gpt35-58832.c line 42 column 9 function sanitize_url",
   "stack_trace": "c:@F@sanitize_url at file gpt35-58832.c line 15 .. array bounds violated ..."
   "error_type": "array bounds violated: lower bound",
   "code_snippet": "<... C code snippet ...>",
   "source_code": "<... Full C code  ...>",
   "num_lines": 53,
   "cyclomatic_complexity": 3.0
}

Downloading and using FormAI-v2 dataset

The dataset includes three files:

a. FormAI-v2-DATASET.7z - This file contains all the compilable 265,000 C files.

b. FormAI-v2-classification.7z.001 and FormAI-v2-classification.7z.002, which contain FormAI-v2.json, including the vulnerability classification of the files along with the C code.

The FormAI-v2 dataset is suitable for machine learning training, testing vulnerability detection softwares, and various tasks that require a substantial collection of C files. The validation of dataset compilation was conducted on a system running Ubuntu LTS version 22.04.03 with a Linux kernel version 5.19.0 and gcc 11.4.0

To begin using the FormAI-v2 dataset, follow the instructions below:

Start by obtaining the entire FormAI dataset from Github with the given command:
```
git clone https://github.com/FormAI-Dataset/FormAI-dataset/
```
The following sequence of commands includes an operating system upgrade and the installation of the build-essential package, a crucial requirement for the proper compilation of the dataset.
```
sudo apt update && sudo apt upgrade -y && sudo apt install build-essential -y
```

The subsequent step involves installing all the necessary dependencies to compile all the 265,000 C files present in the dataset.

sudo apt install -y build-essential csvtool unzip 7zip libsqlite3-dev libfftw3-dev libssl-dev libmysqlclient-dev libpq-dev libportaudio2 portaudio19-dev libpcap-dev libqrencode-dev libsdl2-dev freeglut3-dev libcurl4-openssl-dev

Unzip the .C samples from the compressed file (this action will generate a DATASET-v2 directory containing all the C files and a JSON file called FormAI-v2.json):
```
7z x FormAI-dataset/FormAI-v2-DATASET.7z  && 7z x FormAI-dataset/FormAI-v2-classification.7z.001
```
One can verify the successful installation of the dependencies by testing 1000 files from the dataset. This command will verify that the first 1000 files from the dataset compile successfully (this process usually takes around 2-3 minutes).
```
find DATASET-v2 -name "*.c" | head -n 1000 | xargs -I{} bash -c 'gcc {} -w -lcrypto -lfftw3 -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm &>/dev/null &&  echo {}' | wc -l
```
NOTE: If the output is not equal to 1000, it indicates that there may be missing dependencies or that certain programs fail to compile, particularly in earlier versions of GCC. This situation can occur in Ubuntu 20.04 LTS, where the default GCC version is 9.4.0, which is older than 11.4.0. However, it's important to note that this should not impact the usability of the DATASET.
If the result is 1000 from the previous run, it indicates that all the tested files were compiled without issues. Our system is ready to use the provided files. To retrieve and compile a C code from the JSON file, use the command below (taking the 12345th JSON entry as a example):
```
gcc -x c <(jq -r '.[12345].source_code' FormAI-v2.json) -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm -o output
```
NOTE: FormAI-v2.json is more than 2.5GB, so extracting C code from JSON this way is not the most efficient method. However, it is suitable for demonstration purposes. For more efficient processing, using Python is recommended.
Each program is categorized according to the vulnerabilities found in its code, using a formal verification technique that leverages the Efficient SMT-based Bounded Model Checker (ESBMC).

ESBMC is a mature, permissively licensed open-source context-bounded model checker for verifying single- and multithreaded C/C++, Kotlin, and Solidity programs. It can automatically verify predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions.

The Github page for ESBMC can be found at the following link: https://github.com/esbmc/esbmc. ESBMC depends on numerous external tools and has a variety of intricate dependencies, from SMT solvers to more complex ones. Nevertheless, there's an option to utilize a pre-compiled binary version, which can be obtained using the command provided below:
```
wget https://github.com/esbmc/esbmc/releases/download/v7.5/ESBMC-Linux.zip && unzip ESBMC-Linux.zip && rm ESBMC-Linux.zip && chmod 777 bin/esbmc
```

If you wish to test ESBMC 7.5 on an individual file from the FormAI-v2 dataset, use the command below (taking falcon180b-1656.c as a reference):

bin/esbmc DATASET-v2/falcon180b-1656.c  --overflow --memory-leak-check --timeout 30 --unwind 1 --multi-property --no-unwinding-assertions

The file appears to be vulnerable, revealing a buffer overflow in the scanf() function. Below is the output from ESBMC:

  Enter a URL: 
  
  State 2 file falcon180b-1656.c line 67 column 5 function main thread 0
  ----------------------------------------------------
  Violated property:
    file falcon180b-1656.c line 67 column 5 function main
    buffer overflow on scanf
    0
  
  
  VERIFICATION FAILED

As a concluding action, you can confirm that this vulnerability is indeed present in the FormAI-v2 dataset by executing the command below:

jq -r '.[] | select(.file_name == "falcon180b-1656.c")' FormAI-v2.json

The outpus is the same as ESBMC found : buffer overflow on line 67

{
  "category": "VULNERABLE",
  "file_name": "falcon180b-1656.c",
  "verification_finished": "yes",
  "vulnerable_line": 67,
  "column": 5,
  "function": "main",
  "violated_property": "\n  file falcon180b-1656.c line 67 column 5 function main\n",
  "stack_trace": "\n  c:@F@main\n  buffer overflow on scanf\n  0",
  "error_type": "buffer overflow on scanf",
  "code_snippet": "}\n\nint main() {\n    char url[MAX_URL_LENGTH];\n    printf(\"Enter a URL: \");\n    scanf(\"%s\", url);\n    sanitize_url(url);\n    printf(\"Sanitized URL: %s\\n\", url);\n    return 0;\n}",
  "source_code": "//Falcon-180B DATASET v1.0 Category: URL Sanitizer ; Style: puzzling\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <ctype.h>\n\n#define MAX_URL_LENGTH 2048\n\nint is_valid_url_char(char c) {\n    if (isalnum(c) || c == '-') {\n        return 1;\n    }\n    switch(c) {\n        case '.':\n        case '_':\n        case '~':\n        case '+':\n        case '=':\n        case '%':\n        case '?':\n        case '#':\n        case '&':\n        case '/':\n        case ':':\n        case ';':\n        case ',':\n        case '@':\n        case '!':\n        case '$':\n        case '*':\n        case '(':\n        case ')':\n        case '\"':\n            return 1;\n        default:\n            return 0;\n    }\n}\n\nint is_valid_url(char* url) {\n    int i = 0;\n    while (url[i]!= '\\0' && i < MAX_URL_LENGTH) {\n        if (!is_valid_url_char(url[i])) {\n            return 0;\n        }\n        i++;\n    }\n    return 1;\n}\n\nvoid sanitize_url(char* url) {\n    int i = 0;\n    while (url[i]!= '\\0' && i < MAX_URL_LENGTH) {\n        if (!is_valid_url_char(url[i])) {\n            url[i] = '%';\n            i++;\n        } else {\n            i++;\n        }\n    }\n    url[MAX_URL_LENGTH - 1] = '\\0';\n}\n\nint main() {\n    char url[MAX_URL_LENGTH];\n    printf(\"Enter a URL: \");\n    scanf(\"%s\", url);\n    sanitize_url(url);\n    printf(\"Sanitized URL: %s\\n\", url);\n    return 0;\n}",
  "num_lines": 71,
  "cyclomatic_complexity": 8.0
}

Guidelines for Getting Started with FormAI-v1

All C programs in this dataset were generated by GPT-3.5-turbo. From the entire dataset 109,757 sample files can be compiled using only the command 'gcc -lm'. However, the remaining 3% of samples are more complex, utilizing external libraries such as OpenSSL, sqlite3, and others. After installing the required dependencies, all files should compile without any issues.

The FormAI dataset is suitable for machine learning training, testing vulnerability detection softwares, and various tasks that require a substantial collection of C files.

In this guide, we illustrate the process of classifying the samples in the FormAI dataset and how to extract the C source code from the categorized csv files. This serves as an ideal foundation for those looking to build upon these findings in future research.

The validation of dataset compilation was conducted on a system running Ubuntu LTS version 22.04.03 with a Linux kernel version 5.19.0 and gcc 11.4.0

Start by obtaining the entire FormAI dataset from Github with the given command:
```
git clone https://github.com/FormAI-Dataset/FormAI-dataset/
```
The following sequence of commands includes an operating system upgrade and the installation of the build-essential package, a crucial requirement for the proper compilation of the dataset.
```
sudo apt update && sudo apt upgrade -y && sudo apt install build-essential -y
```

The subsequent step involves installing all the necessary dependencies to compile the 112,000 C files present in the dataset.

sudo apt install -y build-essential csvtool unzip libsqlite3-dev libssl-dev libmysqlclient-dev libpq-dev libportaudio2 portaudio19-dev libpcap-dev libqrencode-dev libsdl2-dev freeglut3-dev libcurl4-openssl-dev

Unzip the .C samples from the compressed file (this action will generate a DATASET directory containing all the files):
```
unzip FormAI-dataset/FormAI_dataset_C_samples-V1.zip && unzip FormAI-dataset/FormAI_dataset_classification-V1.zip
```
One can verify the successful installation of the dependencies by testing 1000 files from the dataset. This command will verify that the first 1000 files from the dataset compile successfully (this process usually takes around 2-3 minutes).
```
find DATASET -name "*.c" | head -n 1000 | xargs -I{} bash -c 'gcc {} -w -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm &>/dev/null &&  echo {}' | wc -l
```
NOTE: If the output is not equal to 1000, it indicates that there may be missing dependencies or that certain programs fail to compile, particularly in earlier versions of GCC. This situation can occur in Ubuntu 20.04 LTS, where the default GCC version is 9.4.0, which is older than 11.4.0. However, it's important to note that this should not impact the usability of the DATASET.
If the result is 1000 from the previous run, it indicates that all the tested files were compiled without issues. Our system is ready to use the provided files. To retrieve and compile the C code from a specific row in the FormAI_dataset.csv, use the command below (taking row number 3679 as a example):
```
gcc -x c <(csvtool drop 3679 FormAI_dataset.csv | csvtool -t ',' head 1 - | csvtool -t ',' col 3 - | sed 's/^"//; s/"$//' | sed 's/""/"/g') -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm -o output
```
Each program is categorized according to the vulnerabilities found in its code, using a formal verification technique that leverages the Efficient SMT-based Bounded Model Checker (ESBMC).

ESBMC is a mature, permissively licensed open-source context-bounded model checker for verifying single- and multithreaded C/C++, Kotlin, and Solidity programs. It can automatically verify predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions.

The Github page for ESBMC can be found at the following link: https://github.com/esbmc/esbmc ESBMC depends on numerous external tools and has a variety of intricate dependencies, from SMT solvers to more complex ones. Nevertheless, there's an option to utilize a pre-compiled binary version, which can be obtained using the command provided below:
```
wget https://github.com/esbmc/esbmc/releases/download/v7.3/ESBMC-Linux.zip && unzip ESBMC-Linux.zip && rm ESBMC-Linux.zip && chmod 777 bin/esbmc
```

If you wish to test ESBMC on an individual file from the FormAI dataset, use the command below (taking FormAI_14569.c as a reference):

bin/esbmc DATASET/FormAI_14569.c  --overflow --memory-leak-check --timeout 30 --unwind 1 --multi-property --no-unwinding-assertions

The file appears to be vulnerable, revealing a buffer overflow in the scanf() function. Below is the output from ESBMC:

  State 6 file FormAI_14569.c line 24 column 9 function main thread 0
  ----------------------------------------------------
  Violated property:
    file FormAI_14569.c line 24 column 9 function main
    buffer overflow on scanf
    0

  VERIFICATION FAILED

As a concluding action, you can confirm that this vulnerability is indeed present in the FormAI dataset by executing the command below:
```
cat FormAI-dataset/FormAI_dataset_human_readable-V1.csv | grep 'FormAI_14569.c'
```

The outpus is the same as ESBMC found : FormAI_14569.c,VULNERABLE,main,24.0,buffer overflow on scanf

Disclaimer

SHA256 checksum of the files:

FormAI-v1 dataset files:

FormAI_dataset_C_samples-V1.zip : fc458020ad0e9b0999882d4bfdd27edfdee2f0ff5de22848e11352d64230ca47 FormAI_dataset_classification-V1.zip : 82b00611d71ff992d85b7bba20ebece580bfd055939b0142861326dff17ede74
FormAI_dataset_human_readable-V1.csv : 316d7145006f48ec42c9a89b1dccb2c5e2c05012c926af65c0269aba45d47830

FormAI-v2 dataset files:

FormAI-v2-DATASET.7z : 3efbe466d65f96a91ebbf388d7e3a6c1e15941fffd5a5f10f10de2eeec9ebef3
FormAI-v2-classification.7z.001 : 1bdca8c4942ee9dd6f5a772640b9c9b9732ddc67afd996babde92e8c20c10b91
FormAI-v2-classification.7z.002 : 83b862cb732c7298a5c3e8798ae873cdf80ab7670defc9a2a0d9d3eaa8bbd127

WARNING: BE CAREFUL WHEN RUNNING THE COMPILED CODES, SOME CAN CONNECT TO THE WEB, SCAN YOUR LOCAL NETWORK, OR DELETE A RANDOM FILE FROM YOUR FILE SYSTEM. ALWAYS CHECK THE SOURCE CODE AND THE COMMENTS IN THE FILE BEFORE RUNNING IT!!!

FormAI-Dataset/FormAI-dataset