/RepresentationsLearningFromMulti_domain

To learn function representations from multi-domain knowledge bases for software vulnerability detection.

Primary LanguagePython

RepresentationsLearningFromMulti_domain

Hi there, welcome!

This page contains the code and data used in the paper [Software Vulnerability Discovery via Learning Multi-domain Knowledge Bases].

Idea:

Machine Learning (ML) based techniques for automated software vulnerability detection is a promising research direction. However, the performance of ML-based vulnerability detection approaches is far from satisifactory. Therefore manual detection still holds a dominent position. One of the main reasons is that there are insufficient high-quality labeled vulnerability data available for training a statistically robust classifier (in supervised detection scenario). In this paper, we proposed an idea to utilize multiple vulnerability-relevent data sources to compensate for the shortage of the labeled data and facilitate the learning of high-level representations that are indicative of vulnerable function patterns.

The vulnerability-relevent data sources refer to the available labeled data provided by other studies and the artificially constructed vulnerability test cases from the the Software Assurance Reference Dataset (SARD) project. We propose a framework to bridge the differences between the data from the two sources, and to learn function-level representations as effective features without the need of tedious manual feature engineering process. Besides, the framework can be extended if more data sources are available. The experiments demonstrated the effectiveness of our proposed framework.

About this repository:

The "Code" folder contains the Python code for:

  1. Load the source code functions -- LoadCFilesAsText.py
  2. Train Bi-LSTM model -- BiLSTM_model.py and Train_BiLSTM.py
  3. Obtain representations -- ObtainRepresentations.py
  4. Use the representations as features for training and test -- testRandomForest.py

The "Data" folder contains the data samples used in the paper to replicate and validate the findings:

For a small test, you can use the following:

  1. The extracted serialized Abstract Syntax Tree (AST) sequence samples from project FFmpeg -- stored in the "The_CVE_samples" folder.
  2. The source code of the SARD project test case and the obfuscated version -- stored in the "The_SARD_samples" folder.

We also added vulnerable and non-vulnerable functions labeled and collected from 6 open source projects, which can be found in the Six_projects.zip file in the "Data" folder.

Reproduction:

  1. Software Environment

The dependencies can be installed using Anaconda. For example:

$ bash Anaconda3-5.3.1-Linux-x86_64.sh
  1. Hardware
  • An NVIDIA GPU with as least 4G video memory is required (Due to using CuDNNLSTM, training on a mainstream GPU had a 12x ~ 50x speedup compared with training on a server equipped with two high-end Intel Xeon CPUs with totally 48 logical cores on our data set).
  1. Data preprocessing
  • The AST parser -- The AST of a source code function can be extracted using the tool called CodeSensor. In our paper, we used the old version which is codeSensor-0.2.jar. The parser does not require a build environment/supporting libraries for parsering the source code functions to ASTs.
  • The obfuscation tool we used for obfuscating the SARD project samples is called Snob.

Thank you!