This repository contains the code for TreeCaps introduced in the following paper TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing (NeurIPS Workshops 2019)
Vinoj Jayasundara, Nghi Duy Quoc Bui, Lingxiao Jiang, David Lo
Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to reduce time spent in code navigation and understanding, and thus increase productivity. Different from natural language articles, source code in programming languages often follows rigid syntactical structures and there can exist dependencies among code elements that are located far away from each other through complex control flows and data flows. Existing studies on tree-based convolutional neural networks (TBCNN) and gated graph neural networks (GGNN) are not able to capture essential semantic dependencies among code elements accurately. In this paper, we propose novel tree-based capsule networks (TreeCaps) and relevant techniques for processing program code in an automated way that encodes code syntactical structures and captures code dependencies more accurately. Based on evaluation on programs written in different programming languages, we show that our TreeCaps-based approach can outperform other approaches in classifying the functionalities of many programs.
Our system comprises three main steps as follows:
(a) TreeCaps approach Overview. The source codes are parsed, vectorized and fed into the proposed TreeCaps network for the program classification task
(b) Tree Vectorization, which generates the AST from the source code and vectorizes it using an embedding generation technique
(c) Variable-to-Static Routing, which routes a variable set of capsules to generate a static set of capsules
(d) Dynamic Routing between the Primary Static Capsules and the Code Capsules
-
Install requirements.txt and the required dependencies
pip install -r requirements.txt
. -
Clone this repo:
git clone https://github.com/vinojjayasundara/treecaps.git
. -
Download and extract the dataset and the pre-trained embedding.
-
Simply run
python job.py
. -
Note the following in the
job.py
:
* Set training = 1
for training the model and training = 0
for testing.
* Uncomment the lines 18-20
in job.py
to continue training with a reduced learning rate.
We used three datasets in three programming languages to ensure cross-language robustness:
- Dataset A: 6 classes of sorting algorithms, with 346 training programs on average per class, written in Python.
- Dataset B: 10 classes of sorting algorithms, with 64 training programs on average per class, written in Java.
- Dataset C: 104 classes of C programs, with 375 training programs on average per class.
Comparison of TreeCaps with other approaches. The means and the standard deviations from 3 trials are shown.
Model | Dataset A | Dataset B | Dataset C |
---|---|---|---|
GGNN (Allamanis et al.) | - | 85.00% | 86.52% |
TBCNN (Mou et al.) | 99.30% | 75.00% | 79.40% |
TreeCaps | 100.00 ± 0.00% | 92.11 ± 0.90% | 87.95 ± 0.23% |
TreeCaps (3-ensembles) | 100.00% | 94.08% | 89.41% |
We have used this as the base CapsNet implementation and this as the base Tree-based convolution implementation. We thank and credit the contributors of these repositories.
vinojjayasundara@gmail.com
Discussions, suggestions and questions are welcome!