The aim of the project is to build a tool that can automatically produce a natural language summary of source code written in Python
Directory | Description |
---|---|
V2 | V2/parallel is the initial extracted function declaration, description & body. V2/repo_split contains the splits used in the original paper |
data | Contains the processed dataset of function declaration+body, description, and generated ASTs for Python & Java |
notebooks for java data | Contains the notebook artifacts used to run the process the data, train, and translate. The notebooks act as logs of training, showing the moving train acc, periodic validation acc, and runtimes involved for each experiment. |
models for python data | The models generated for python dataset |
predictions for java | Contains the predictions output from various tests conducted to evaluate the importance of different prep-processing steps for Java dataset |
evals | Contains evaluation scores |
utils | Helper methods for preprocessing tasks |
parallel-corpus | The raw corpus used in the original paper |
The Python3 source codes, obtained from GitHub repositories, were preprocessed by removing the comments, removing semantically irrelevant spaces or new lines. An example of an extracted function is as follows:
def _interceptdot (w, X, y):
”””
Computes y∗np . dot (X, w).
It takes into consideration whether the intercept should be fit.
Parameters
−−−−−−−−−−
w : ndarray , ndarray , shape ( n_features , ) or ( n_features + 1 , )
Coefficient vector .
[ . . . ]
”””
c = 0
if w.size == X. shape [ 1 ] + 1:
c = w[−1]
w = w[:−1]
z = safe_sparse_dot (X, w) + c
yz = y∗z
return w, c , yz
def _interceptdot(w, X, y):
’ Computes y∗np . dot (X, w) . DCNL It takes into consideration if the intercept should be fit or not. DCNL P arameters DCNL w : ndarray , shape ( n_features , ) or (n_features + 1 , ) DCNL Coefficient vector . DCNL [ . . . ] ’
DCSP c = 0.0 DCNL DCSP if (w.size == (X.shape [1] + 1) ): DCNL DCSP DCSP c = w[(−1)] DCNL DCSP DCSP w = w[:(−1)] DCNL DCSP z =(safe_sparse_dot(X, w) + c) DCNL DCSP yz = (y∗z) DCNL DCSP return w, c , yz
"Module(body=[FunctionDef(name='_interceptdot', args=arguments(args=[arg(arg='w', annotation=None), arg(arg='X', annotation=None), arg(arg='y', annotation=None)], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Expr(value=Str(s='\\n Computes y∗np . dot (X, w).\\n It takes into consideration whether the intercept should be fit.\\n Parameters\\n −−−−−−−−−−\\n w : ndarray , ndarray , shape ( n_features , ) or ( n_features + 1 , )\\n Coefficient vector .\\n [ . . . ]\\n\\n ')), Assign(targets=[Name(id='c', ctx=Store())], value=Num(n=0)), If(test=Compare(left=Attribute(value=Name(id='w', ctx=Load()), attr='size', ctx=Load()), ops=[Eq()], comparators=[BinOp(left=Subscript(value=Attribute(value=Name(id='X', ctx=Load()), attr='shape', ctx=Load()), slice=Index(value=Num(n=1)), ctx=Load()), op=Add(), right=Num(n=1))]), body=[Assign(targets=[Name(id='c', ctx=Store())], value=Subscript(value=Name(id='w', ctx=Load()), slice=Index(value=UnaryOp(op=USub(), operand=Num(n=1))), ctx=Load())), Assign(targets=[Name(id='w', ctx=Store())], value=Subscript(value=Name(id='w', ctx=Load()), slice=Slice(lower=None, upper=UnaryOp(op=USub(), operand=Num(n=1)), step=None), ctx=Load())), Assign(targets=[Name(id='z', ctx=Store())], value=BinOp(left=Call(func=Name(id='safe_sparse_dot', ctx=Load()), args=[Name(id='X', ctx=Load()), Name(id='w', ctx=Load())], keywords=[]), op=Add(), right=Name(id='c', ctx=Load()))), Assign(targets=[Name(id='yz', ctx=Store())], value=BinOp(left=Name(id='y', ctx=Load()), op=Mult(), right=Name(id='z', ctx=Load()))), Return(value=Tuple(elts=[Name(id='w', ctx=Load()), Name(id='c', ctx=Load()), Name(id='yz', ctx=Load())], ctx=Load()))], orelse=[])], decorator_list=[], returns=None)])"
Details on the dataset can be found at http://leclair.tech/data/funcom/#procdata
We use the filtered / tokenized datasets for the purpose of this project.
Due to size restrictions of Github, the all files can be downloaded from the following links:
Each link contains a compressed file with the following folders:
- train
- functions
- comments
- valid
- functions
- comments
- test
- functions
- comments
NOTE: The AST files were generated after reading in the source code from the JSON files, using the javalang module in Python
NOTE: Cleaned entails the removal of parentheses, brackets, and other special characters. For example:
NOTE: preprocessed here means:
- The methods are split between method name and method body AND both have been tokenized and subtokenized
- The method name has been stripped of parameters
- The methody body has been converted to an AST with several paths sampled between terminal nodes
- The following are the inputs and targets when AST encodings are used :
- unprocessed_ast_list_train.zip
- unprocessed_ast_list_dev.zip
- unprocessed_ast_list_test.zip
- unprocessed_summary_train.zip
- unprocessed_summary_dev.zip
- unprocessed_summary_test.zip
- The following are the inputs and targets when AST encodings are not used. Instead, function bodies are used to train the model :
- input_train.txt
- input_dev.txt
- input_test.txt
- summary_train.txt
- summary_dev.txt
- summary_test.txt
Source: code2seq
We have used the Transformer model as the main architecture for our work. The standard encoder-decoder architecture has served to perform well for many machine translation tasks, hence the transformer architecture was built on top of it. The success of transformers in most NLP tasks has proven the fact that a simple architecture composed of attention mechanisms alone, can outperform the complex CNN and RNN architectures.
- Evaluate the influence of Abstract Syntax Trees in the input to our summarization model
- Assess the effect of static vs. dynamic typing by training the model on both Java and Python datasets
- Gauge the importance of pre-processing the dataset before feeding it to the model vs. not performing any pre-processing
- BLEU score
- ROUGE-1
- ROUGE-2
- ROUGE-3
[1] Barone, Antonio Valerio Miceli, and Rico Sennrich. "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation." arXiv preprint arXiv:1707.02275 (2017).
[2] Alon, Uri, Omer Levy, and Eran Yahav. "code2seq: Generating sequences from structured representations of code." arXiv preprint arXiv:1808.01400 (2018).