Experiment 4 - Transformer for Software Vulnerability Detection

This repo is for the application of Transformer based Neural Network for software vulnerability detection

Transformer

We tokkenize the source codes based on the syntactic structure.
Punctuations are kept because they are crucial in source codes. Therefore, if we remove them, we will love the syntactic structure of the source codes, thus alternate the semantic meaning of that functions.

We implemented the Transformer encoder layer only to train embeddings of the source codes.
The output of the encoder is feed into a 2 layer bi-directional LSTM.
The LSTM will output 2 tensors.
Softmax activation with cross entropy loss is used.

The ratio of the dataset for training : validation : testing = 80% : 10% : 10%
The ratio follows the actual ratio used in Russell et. al (2018)
The training session:
- 8 x Nvidia Tesla K80 (11GB memory) / 8 x Nvidia Tesla K80
- Pytorch 1.2 (CuDNN v9.2)
- Training time ~90 minutes per epoch / 180+ minutes per epoch

Highlighted the best value for each metric.

Name	Tokenizer	Ep	Weights	Vocab	TP	FP	TN	FN	Acc	Pr	Rec	PR-AUC	AUC	MCC	F1
Run_1	Custom	11	1:5	10K	5834	6974	112192	2419	0.9263	0.4555	0.7069	0.4926	0.9042	0.5307	0.5540
Run_5	Custom	14	1:3	10K	5394	5770	113396	2859	0.9323	0.4832	0.6536	0.4887	0.9030	0.5268	0.5556
Run_7	Clang	16	1:5.5	8K	5999	7216	111950	2254	0.9257	0.4540	0.7267	0.4967	0.9115	0.5379	0.5589
Run_8	Clang	14	1:5.5	7K	6034	7379	111787	2219	0.9247	0.4499	0.7311	0.4966	0.9107	0.5367	0.5570

This is a table of results for all the implementation done for Draper VDISC dataset.

Reference	TP	FP	TN	FN	Acc	Precision	Recall	PR-AUC	AUC	MCC	F1
Russel et.al (2018) - CNN+RF	NA	NA	NA	NA	NA	NA	NA	0.518	0.904	0.536	0.566
Russel et.al (2018) - CNN	NA	NA	NA	NA	NA	NA	NA	0.467	0.897	0.509	0.540
Replication of Russell (2018) - CNN 4th model	5093	8666	110500	3160	0.9071	0.3701	0.6171	0.3665	0.8830	0.4317	0.4627
ASTNN (CodeSensor-Complete)	390	3551	113984	7806	0.9097	0.099	0.0476	0.0535	0.4278	0.0246	0.0643
ASTNN (CodeSensor-Minimal)	1091	14145	93641	5267	0.8299	0.0717	0.1716	0.0520	0.4760	0.0272	0.1010

This Transformer based approach manages to beat some of the metrics reported by Russell et. al. (2018)
However, this model is in early stages. There a still room for improvements since a lot of hyper-parameter tuning can be done.
It is also safe to note that, this could be potential path to explore as Transformer (2017) architecture has evolved greatly. For example, BERT (2018) and ALBERT (2019). These advancements in Transformer network allows them to achieve SOTA performance on benchmark datasets in NLP.
What about source codes?