Deep learning based vulnerability detection model
CONCOCTION is an automated machine learning based vulnerability detection framework. This is the first DL system to learn program presentations by combining static source code information and dynamic program execution traces.
Check our paper for detailed information.
Concoction builds upon:
- Python v3.6
- KLEE v3.0
- LLVM v13.0
The system was tested on the following operating systems:
- Ubuntu 18.04
See INSTALL.md for further details.
See usage.md for a step-by-step demo of Concoction.
This is our introduction to the training and validation dataset used in the paper. You can download the dataset from here.
(We manually verified the original dataset and removed low-quality samples, ensuring faster training speed without compromising the model's performance)
This folder contains the datasets used in our paper.
github:
This folder contains C functions from C-language open-source projects.
sard:
This folder contains C functions from the SARD standard vulnerability dataset.
All the data is stored in .zip
files. After decompression, you will find .txt
files,
each of which represents a C function feature file.
Each feature file(eg.2ok_jpg.c-ok_jpg_convert_data_unit_grayscale.c.txt)
includes static features
(AST,CFG,DFG and other seven edges) and dynamic features
(input variable values and execution traces).
Items | Labels | Values |
---|---|---|
Vulnerability or not | -----label----- | 0/1 |
Source code | -----code----- | static void ok_jpg_convert_d... |
Code relationship flow edges | -----children----- | 1,2 1,3 ... 1,4 |
Code relationship flow edges | -----nextToken----- | 2,4,7,9,10,13,15, |
Code relationship flow edges | -----computeFrom----- | 42,43 42,44 69,70 ... |
Code relationship flow edges | -----guardedBy----- | 90,92 101,102 101,103 ... |
Code relationship flow edges | -----guardedByNegation----- | 124,125 125,126 125,127 ... |
Code relationship flow edges | -----lastLexicalUse----- | 42,44 43,44 47,48 ... |
Code relationship flow edges | -----jump----- | 21,22 21,23 23,24 ... |
Node tokens | -----ast_node----- | const uint8_t *y const uint8_t uint8_t ... |
... | ... | ... |
Input variable values | =======testcase======== | y_inc:0x00000000 x_inc:0x00000000 ... |
Execution traces | =========trace========= | for(int x = 0;x < max_width;x++) out[0] = y[x]; out[1] = y[x]; ... |
A full list of code vulnerabilities discovered by Concoction can be found here.
We welcome contributions to Concoction. If you are interested in contributing please see this document.
If you use CONCOCTION in any of your work, please cite our paper:
@inproceedings{Concoction,
title={Combining Structured Static Code Information and Dynamic Symbolic Traces for Software Vulnerability Prediction},
author={Huanting Wang, Zhanyong Tang, Shin Hwei Tan, Jie Wang, Yuzhe Liu, Hejun Fang, Chunwei Xia, Zheng Wang},
booktitle={The IEEE/ACM 46th International Conference on Software Engineering (ICSE)},
year={2024},
}