/CONCOCTION

CONCOCTION is an automated machine learning-based vulnerability detection framework that combines static source code information and dynamic program execution traces.

Primary LanguageC

Maintenance License CC-BY-4.0 Documentation Status

1683381967744

Deep learning based vulnerability detection model

Introduction

CONCOCTION is an automated machine learning based vulnerability detection framework. This is the first DL system to learn program presentations by combining static source code information and dynamic program execution traces.

Check our paper for detailed information.

Installation

Concoction builds upon:

  • Python v3.6
  • KLEE v3.0
  • LLVM v13.0

The system was tested on the following operating systems:

  • Ubuntu 18.04

See INSTALL.md for further details.

Usage

See usage.md for a step-by-step demo of Concoction.

Data

This is our introduction to the training and validation dataset used in the paper. You can download the dataset from here.

(We manually verified the original dataset and removed low-quality samples, ensuring faster training speed without compromising the model's performance)

Open datasets used in training and evaluation

This folder contains the datasets used in our paper.

github: This folder contains C functions from C-language open-source projects.

sard: This folder contains C functions from the SARD standard vulnerability dataset.

Data structure

All the data is stored in .zip files. After decompression, you will find .txt files, each of which represents a C function feature file. Each feature file(eg.2ok_jpg.c-ok_jpg_convert_data_unit_grayscale.c.txt) includes static features (AST,CFG,DFG and other seven edges) and dynamic features (input variable values and execution traces).

Description of text example

Items Labels Values
Vulnerability or not -----label----- 0/1
Source code -----code----- static void ok_jpg_convert_d...
Code relationship flow edges -----children----- 1,2
1,3
...
1,4
Code relationship flow edges -----nextToken----- 2,4,7,9,10,13,15,
Code relationship flow edges -----computeFrom----- 42,43
42,44
69,70
...
Code relationship flow edges -----guardedBy----- 90,92
101,102
101,103
...
Code relationship flow edges -----guardedByNegation----- 124,125
125,126
125,127
...
Code relationship flow edges -----lastLexicalUse----- 42,44
43,44
47,48
...
Code relationship flow edges -----jump----- 21,22
21,23
23,24
...
Node tokens -----ast_node----- const uint8_t *y
const uint8_t
uint8_t
...
... ... ...
Input variable values =======testcase======== y_inc:0x00000000
x_inc:0x00000000
...
Execution traces =========trace========= for(int x = 0;x < max_width;x++)
out[0] = y[x];
out[1] = y[x];
...

Main Results

A full list of code vulnerabilities discovered by Concoction can be found here.

Contributing

We welcome contributions to Concoction. If you are interested in contributing please see this document.

Citation

If you use CONCOCTION in any of your work, please cite our paper:

@inproceedings{Concoction,
      title={Combining Structured Static Code Information and Dynamic Symbolic Traces for Software Vulnerability Prediction},
      author={Huanting Wang, Zhanyong Tang, Shin Hwei Tan, Jie Wang, Yuzhe Liu, Hejun Fang, Chunwei Xia, Zheng Wang},
      booktitle={The IEEE/ACM 46th International Conference on Software Engineering (ICSE)},
      year={2024},
}