Software defect prediction with semantic and structural information of codes based on Graph Neural Networks
Supplementary code and data of the paper Software defect prediction with semantic and structural information of codes based on Graph Neural Networks.
@article{zhou2022software, title={Software defect prediction with semantic and structural information of codes based on Graph Neural Networks}, author={Zhou, Chunying and He, Peng and Zeng, Cheng and Ma, Ju}, journal={Information and Software Technology}, volume={152}, pages={107057}, year={2022}, publisher={Elsevier} }
This work is the extension of GCN2defect: Graph Convolutional Networks for SMOTETomek-based Software Defect Prediction.
@INPROCEEDINGS{9700305, author={Zeng, Cheng and Zhou, Chun Ying and Lv, Sheng Kai and He, Peng and Huang, Jie}, booktitle={2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)}, title={GCN2defect : Graph Convolutional Networks for SMOTETomek-based Software Defect Prediction}, year={2021}, volume={}, number={}, pages={69-79}, doi={10.1109/ISSRE52982.2021.00020}}
In each subdirectory, there is already exsit the corresponding Class Dependency Network (CDN) (node.txt and edges.txt). If you want to generate your own CDN, you can use the Dependencyfinder API.
We have placed the processed AST and extracted token sequences (tokens.txt) in each subdirectory.
Before training CGCN, we have to provide the attributes of the CDN nodes. Thus, three types of node metrics are introuduced as node attributes:
1) Traditional Static Code Metric: 20 manually designed metrics (Process-Binary.csv).
2) Network Embedding Metric: use the ProNE implementation to generate the network embedding file.
Run CGCN.py to generate embeddings.
The GCN part of our model is modified based on stellargraph](https://github.com/stellargraph/stellargraph). The GCN demo shows in https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html.
If you want to change to your own dataset, you need the following steps:
1) Replace the name in the red box in the following figure with the name of your dataset.
2) Place the mouse over the dataset, then press Ctrl, and click to enter init.py.
Add the name of your dataset in init.py.
3) Place the mouse over the dataset name (except for the dataset name you just created), then press Ctrl, and click to enter datasets.py.
4) Create your own class in datasets.py. For example, the following code is to create Ant dataset:
class Ant(
DatasetLoader,
name="Ant",
directory_name="Ant",
url="",
url_archive_format="",
expected_files=[],
description="",
source="",
):
_NUM_FEATURES = 20
def load(
self,
directed=False,
largest_connected_component_only=False,
subject_as_feature=False,
edge_weights=None,
str_node_ids=False,
):
nodes_dtype = str if str_node_ids else int
return _load_defect_data(
self,
directed,
largest_connected_component_only,
subject_as_feature,
edge_weights,
nodes_dtype,
)
def _load_defect_data(
dataset,
directed,
largest_connected_component_only,
subject_as_feature,
edge_weights,
nodes_dtype,
):
assert isinstance(dataset, (Ant))
if nodes_dtype is None:
nodes_dtype = dataset._NODES_DTYPE
node_data = pd.read_csv("E:\\gcn2defect\\data\\" + dataset.name + "\\Process-Binary.csv")
edgelist = pd.read_csv(
"E:\\gcn2defect\\data\\" + dataset.name+ "\\edges.txt", sep="\t", header=None, names=["target", "source"], dtype=nodes_dtype
)
node_data.apply(pd.to_numeric, errors='ignore')
# 0 to buggy, 1 to clean
subjects_num = node_data['bug']
label_list = subjects_num.to_list()
labels = []
for i in range(len(label_list)):
if label_list[i] == 1:
labels.append('buggy')
else:
labels.append('clean')
subjects = pd.Series(labels, dtype='str')
cls = StellarDiGraph if directed else StellarGraph
features = node_data.iloc[:, 3:-1]
feature_names = node_data.iloc[:, 2]
minMax = preprocessing.MinMaxScaler()
features_std = minMax.fit_transform(features)
graph = cls({"class": features_std}, {"to": edgelist})
if edge_weights is not None:
# A weighted graph means computing a second StellarGraph after using the unweighted one to
# compute the weights.
edgelist["weight"] = edge_weights(graph, subjects, edgelist)
graph = cls({"class": node_data[feature_names]}, {"to": edgelist})
if largest_connected_component_only:
cc_ids = next(graph.connected_components())
return graph.subgraph(cc_ids), subjects[cc_ids]
return graph, subjects
After generating the CGCN embeddings, we can run the downstream task by executing pipeline.py.
python==3.7
stellargraph==1.2.1
tensorflow-gpu==2.0.1
scikit-learn==1.0.2
networkx==2.6.3
projects | version | nodes | defective rate |
---|---|---|---|
ant | 1.4 | 175 | 22.86% |
ant | 1.6 | 343 | 26.82% |
ant | 1.7 | 732 | 22.40% |
camel | 1.2 | 578 | 36.68% |
camel | 1.4 | 805 | 18.01% |
camel | 1.6 | 886 | 21.22% |
jedit | 3.2 | 260 | 34.62% |
jedit | 4.0 | 293 | 25.60% |
jedit | 4.1 | 299 | 26.42% |
lucene | 2.0 | 181 | 50.28% |
lucene | 2.2 | 229 | 62.45% |
lucene | 2.4 | 324 | 62.35% |
poi | 1.5 | 228 | 60.53% |
poi | 2.5 | 371 | 65.77% |
poi | 3.0 | 427 | 65.34% |
velocity | 1.4 | 192 | 76.04% |
velocity | 1.5 | 212 | 66.51% |
velocity | 1.6 | 227 | 34.36% |
xalan | 2.4 | 676 | 16.12% |
xalan | 2.5 | 725 | 50.76% |
xalan | 2.6 | 810 | 46.17% |