- [1] A Framework for Software Defect Prediction and Metric Selection, 2017
Shamsul Huda et al.
paper: https://ieeexplore.ieee.org/abstract/document/8240899
- [2] Learning a Metric for Code Readability, 2010
Raymond P. L. Buse and Westley R. Weimer
paper: https://web.eecs.umich.edu/~weimerw/p/weimer-tse2010-readability-preprint.pdf
source code: http://www.arrestedcomputing.com/readability
Buse為程式可讀性分類模型之始祖,使用數個structural feature做為分類可讀性之依據,使用logistic regression訓練二元分類器,其搜集100份Java snippets之可讀性評分作為訓練用途。
- [3] A General Software Readability Model, 2011
Jonathan Dorn
paper: https://web.eecs.umich.edu/~weimerw/students/dorn-mcs-paper.pdf
Dorn使用傅立葉轉換的技巧幫助分類程式可讀性,其dataset大小為readability研究中最大,包含Python、Java、Cuda各120個snippets之人工標記評分。
- [4] Improving Code Readability Models with Textual Features, 2016
Simone Scalabrino et al.
paper: http://www.cs.wm.edu/~denys/pubs/ICPC'16-Readability.pdf
dataset: https://dibt.unimol.it/icpc2016/appendix.html
提出幾個textual feature做為度量software readability的依據,再加上Buse提出的feature,訓練成程式可讀性的分類器,使用Java file作為訓練dataset
- [5] Improving code readability classification using convolutional neural networks, 2018
Qing Mi et al.
paper: https://www.sciencedirect.com/science/article/pii/S0950584918301496
source code: https://github.com/CityU-QingMi/DeepCRM
第一位使用深度學習訓練程式可讀性之研究,此篇研究使用CNN,並使用CheckStyle和PMD兩款tool自動標記訓練資料(25000筆Java file),為當前可讀性分類準確度最高,其使用token level、character level、和node level (abstract syntax tree)三種表示法做訓練。
- [6] code2vec: Learning Distributed Representations of Code, 2018
URI ALON et al.
paper: https://arxiv.org/pdf/1803.09473.pdf
source code: https://github.com/tech-srl/code2vec
demo: https://code2vec.org/
將source code透過深度學習模型,學習function內之語句內容,並預測該function名稱,其應用可用於大範圍之source code分析研究,例如可幫助工程師檢測function取名是否恰當,亦可幫助defect prediction、code completion等研究。其深度學習模型使用兩層full connected layer和attention機制。
- [7] Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts, 2018
Rohan Bavishi et al.
paper: https://arxiv.org/abs/1809.05193
source code: https://github.com/rbavishi/Context2Name
Javascript領域有許多人會將source code做minification,將變數名稱縮短為無意義之名稱,並且將空格縮排全刪除,以確保達到縮小容量之目的,亦或是達到隱藏程式邏輯之目的,這研究使用深度學習學習source code之邏輯,並將無意義名稱之變數還原成有意義之名稱,以方便閱讀。在此領域之研究,以有JSNice、JSNaughty等還原工具。
- [8] Summarizing Source Code using a Neural Attention Model, 2016
Srinivasan Iyer et al.
paper: https://www.aclweb.org/anthology/P16-1195
source code: https://github.com/sriniiyer/codenn
Code caption,使用深度學習之技術描述一段code在做什麼事情,使用RNN+attention+embedding,縮寫CODE-NN。Dataset使用C#和SQL,但大部分code都不完整,難以生成語法樹。
- [9] A Convolutional Attention Network for Extreme Summarization of Source Code, 2016
Miltiadis Allamanis et al.
paper: https://arxiv.org/pdf/1602.03001.pdf
source code: https://github.com/mast-group/convolutional-attention
使用CNN+Attention預測function名稱
- [10] Improving Automatic Source Code Summarization via Deep Reinforcement Learning, 2018
Yao Wan et al.
paper: https://arxiv.org/pdf/1811.07234.pdf
dataset使用[11]的dataset
- [11] A parallel corpus of Python functions and documentation strings for automated code documentation and code generation, 2017
Antonio Valerio Miceli Barone and Rico Sennrich
paper: https://arxiv.org/pdf/1707.02275.pdf
dataset: https://github.com/EdinburghNLP/code-docstring-corpus
Yao Wan[10]使用此論文之dataset
- [12] Deep Code Comment Generation, 2018
Xing Hu et al.
paper: https://xin-xia.github.io/publication/icpc182.pdf
dataset: https://github.com/xing-hu/DeepCom
Source code summarization,縮寫DeepCom,比CODE-NN強,使用了自創的方法描述AST,因為傳統travesal approach(pre-order、post-order)還原的AST無法唯一。使用Seq2seq model,encoder和decoder都是LSTM,word embedding和hidden state都是512維,使用BLEU-4度量生成的句子。Dataset很多重複data。
- [13] Automatic Source Code Summarization with Extended Tree-LSTM, 2019
Yusuke Shido et al.
paper: https://arxiv.org/pdf/1906.08094.pdf
source code: https://github.com/sh1doy/summarization_tf
Source code summarization,比DeepCom和CODE-NN強,dataset取自DeepCom。這世界上除了LSTM之外,還有Tree-LSTM,它不像LSTM是那種sequence順序的,而是以tree的結構訓練,看一下論文裡的figure 2(c)就能明白。Tree-LSTM又依照他cell裡計算方式的不同分兩種,一種是child-sum Tree-LSTM,另一種是N-ary Tree-LSTM。但由於傳統Tree-LSTM不能handle tree中任意數量的子節點,所以作者提出Tree-LSTM的改良版,Extended Tree-LSTM,也稱為multi-way Tree-LSTM。最後的BLEU4結果,我不明白為何可以那麼低,只有0.2左右,我猜是他code沒寫好,因為DeepCom跟我自己訓練都至少0.35。他在dataset的部分一樣沒有解決DeepCom的dataset有重複的data的問題。
- [14] Recommendations for Datasets for Source Code Summarization, 2019
Alex LeClair, Collin McMillan
paper: https://www.aclweb.org/anthology/N19-1394
data: http://leclair.tech/data/funcom/
待讀
- [15] Does BLEU Score Work for Code Migration?
Ngoc Tran et al.
paper: https://ieeexplore.ieee.org/document/8813269
待讀
- [16] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
Satanjeev Banerjee and Alon Lavie
paper: https://www.aclweb.org/anthology/W05-0909.pdf
待讀
- [17] Automatic Documentation Generation via Source Code Summarization of Method Context
Paul W. McBurney and Collin McMillan
paper: https://www3.nd.edu/~cmc/papers/mcburney_icpc_2014.pdf
用非deep learning的方法生成function的註解,包含“描述程式碼的行為”、“該function什麼樣的function呼叫”、“被何種statement使用(if、for、while)”
- [18] Supporting Program Comprehension with Source Code Summarization
Sonia Haiduc et al.
paper: https://ieeexplore.ieee.org/document/6062165
用非deep learning的方法生出function的數個關鍵字,幫助工程師了解該function的用途
- [19] On the Use of Automated Text Summarization Techniques for Summarizing Source Code
Sonia Haiduc et al.
paper: https://ieeexplore.ieee.org/document/5645482
用非deep learning的方法生出function的數個關鍵字,幫助工程師了解該function的用途,作者為[18]的作者,所以做法很類似
- [20] Towards Automatically Generating Summary Comments for Java Methods
Giriprasad Sridhara et al.
paper: https://dl.acm.org/doi/pdf/10.1145/1858996.1859006
用非deep learning的方法描述function的行為,然而該方法使用的工具已經連結失效
- [21] Generating Natural Language Summaries for Crosscutting Source Code Concerns
Sarah Rastkar et al.
paper: https://ieeexplore.ieee.org/document/6080777
主要針對一個project中的function,找出與該function具有耦合性的物件,並用一篇summary列出說明,如附檔圖一所示。其輸出的內容非描述function的行為,而是描述該function與哪些物件耦合
- [22] Natural Language Models for Predicting Programming Comments
Dana Movshovitz-Attias et al.
paper: https://pdfs.semanticscholar.org/4a90/8858ba9223289c3b3b1d5ceceb9c70a78d6e.pdf
目的為針對一個function註解進行auto completion,對寫到一半的註解,進行自動生成,以加速工程師撰寫註解的速度。例如某個function的註解為“Train a named-entity extractor”,當工程師撰寫到”Train a named-“時,該論文的工具可根據程式碼自動完成”Train a named-entity“,而工程師再次撰寫至”Train a named-entity ext“時,該論文的工具可自動完成”Train a named-entity extractor“
- [23] Automatic Generation of Natural Language Summaries for Java Classes
Laura Moreno et al.
paper: https://ieeexplore.ieee.org/document/6613830
針對Java class進行class分類,並根據他的分類和內容對class生成一句註解