/GraphExtract

Extract inter-procedural control flow graph from C/C++ source code for graph neural networks

Primary LanguagePython

Extract inter-procedural control flow graph from C/C++ source code for graph neural networks

This script is based on the open-source code analysis tool Joern (joern.io).

After parsing the source code, Joern can generate a control flow graph (CFG) for every function in the program. However, it cannot generate a single CFG for the whole program. In other words, Joern cannot find the link between a function call and the called function (until Nov 2022). For example, if there is a Reset() call in the program, we cannot directly find where the definition of Reset() is.

To solve this problem, this script can generate a graph representation of the whole program (like inter-precedual CFG) based on the output from Joern. Such representation can be processed by graph neural networks.

In the representation, there are three types of nodes:

  • Method "start" nodes, i.e., beginnings of methods.
  • Method "end" nodes, i.e., ends of methods.
  • CodeCount nodes which are at the beginnings of every branch. This type of nodes are inserted by pq9cov (link), a simple code coverage collection tool.

Each node has a feature vector which contains:

  • Type of the node (0 for method start, 0.5 for method return, 1 for CodeCount).
  • Method name vector generated by Word2Vec.
  • File name vector generated by Word2Vec.

These nodes are connected by directed edges without weights.

How to use it

First use Joern to generate CFGs for functions in the programs with the following command:

cpg.method.isExternal(false).nameNot(".*<.*>.*").map(node => (node.id, node.methodReturn.id, node.name, node.filename, node.lineNumber.l, node.dotCfg.l)).toJsonPretty |> "methods.txt"

And then you set the line 90 of GraphExtract.py:

wv = MyWord2Vec(path='.\xxx', vectorSize=63)

Note that .\xxx is the path of the folder which contains source code files (after instrumented by pq9cov). vectorSize is the length of the method name vector and file name vector.

Now run GraphExtract.py, you will get a json file with feature vectors and edges of the graph.

More information

Please refer to section 3.4 in the following work:

Li, Z. (2022). Use Reinforcement Learning to Generate Testing Commands for Onboard Software of Small Satellites.

If you use this script, please also cite the work above.