Improved Query Reformulation for Concept Location using CodeRank and Document Structures
Mohammad Masudur Rahman and Chanchal K. Roy
Abstract: During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a., concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique --ACER-- that takes an initial query, identifies appropriate search terms from the source code using a novel term weight --CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.
- ecf (154)
- eclipse.jdt.core (198)
- eclipse.jdt.debug (154)
- eclipse.jdt.ui (309)
- eclipse.pde.ui (302)
- log4j (28)
- sling (76)
- tomcat70 (454)
Total: 1,675
Baseline Method
Query:
Baseline queriesQE:
Query Effectiveness achieved by the baseline queries
ACER-Proposed Method
Query:
Queries suggested by ACERQE:
Query Effectiveness achieved by the ACER queriesACER (Method Signature Only)
Query:
Queries suggested by ACER using method signatures onlyQE:
Query Effectiveness achieved by ACER queries
ACER (Field Signature Only)
Query:
Queries suggested by ACER using field signatures onlyQE:
Query Effectiveness achieved by ACER queries
ACER (Method Signature + Field Signature)
Query:
Queries suggested by ACER using method signatures + field signaturesQE:
Query Effectiveness achieved by ACER queries
Queries & Goldset
bugIDs:
Issue IDs of the selected change requestsgoldset:
Ground truth for the selected change requests
System Corpora & Lucene Indices
corpus:
Source code (raw + preprocessed) of the subject systems, and their file mapping indices.lucene:
Lucene index of all subject systems
ACER Prototype & External Dependencies
acer-runner.jar:
Working prototype of ACERACER Source Code:
ACER source can be found herecandidates:
Reformulation candidates generated by ACERmodels:
Auxiliary data for a dependency librarypp-data:
Stop words for text preprocessingquery-difficulty:
Query difficulty metrics calculated for search queriessamurai-data:
Auxiliary data for token splitting with SamuraiACER-QD-Model:
Artifacts of the developed query difficulty modelacer-qd-model-2019:
ARFF model fileacer-model-header:
ARFF model headermaster-model-dataset:
Query difficulty metrics datasetresampling-*-*
Query difficulty prediction results using resampled data
ACER-query:
Sample queries suggested by ACERACER-rank:
Query Effectiveness (QE) of the ACER queries
Installing, Building & Execution
README:
Tool overview, available operations and the required commandsINSTALL:
Steps of installation and system requirements for ACER
Licensing & Others
LICENSE:
Our artifacts are under MIT licenseCITATION:
It shows how this work should be cited.Screenshots:
It contains the screenshots of the available operations.
getBaselinePerformance:
Shows performances of the baseline queriesgetACERCandidateReformulations:
Generates the reformulation candidates for the given queriesmakeQueryDifficultyModel:
Constructs query difficulty models for ACERgetACERQueries:
Delivers the reformulated queries using the candidates and the query difficulty modelgetACERPerformance:
Shows the document retrieval performance of ACER queriesgetCandidatePerformance:
Shows the document retrieval performance of ACER reformulation candidatesgetACERPerformanceQE:
Shows the Query Effectiveness (QE) of ACER queriesgetCandidatePerformanceQE:
Shows the Query Effectiveness (QE) of ACER reformulation candidates
- -task expects a task to be performed (e.g.,
getBaselinePerformance
) - -K expects the number of top results to be analysed during evaluation, default is 100
- -candidateFolder expects an existing empty folder within the home directory (e.g.,
test-candidates
) - -queryDiffFolder expects an existing empty folder within the home directory (e.g.,
test-query-difficulty
) - -queryFileKey expects a file identifier from
ACER-query
folder (e.g.,acer-best-Aug08
) - -candidateKey expects a candidate reformulation key (e.g.,
method-sig
,field-sig
,both-sig
)
- Execute
git clone https://github.com/masud-technope/ACER-Replication-Package-ASE2017.git ACER
- Run the tool (acer-runner.jar) from within the
ACER
directory
Please make sure that you have the queries within the baseline folder. Reformulating the given queries with ACER involves three steps: (1) Generate reformulation candidates, (2) Construct a query difficulty model, and (3) Generate the ACER queries.
Step I: Generate Reformulation Candidates
java -jar acer-runner.jar -task getACERCandidateReformulations -candidateFolder test-candidates
The above command generates four reformulation candidates for each given query : method-sig
, field-sig
, both-sig
and base
, and stores them in the test-candidates directory.
Step II: Construct Query Difficulty Models
java -jar acer-runner.jar -task makeQueryDifficultyModel -candidateFolder test-candidates -queryDiffFolder test-query-difficulty
The above command calculates query difficulty metrics for the candidate queries, and stores them in test-query-difficulty folder. It also constructs query difficulty models and collects prediction results from them in the ACER-QD-Model directory.
Step III: Generate ACER Queries
java -jar acer-runner.jar -task getACERQueries -candidateFolder test-candidates -queryFileKey test-acer-best-query
The above command delivers the best reformulation candidates as the ACER queries using the above query difficulty model, and stores them as test-acer-best-query in the ACER-query folder.
java -jar acer-runner.jar -task getACERPerformance -K 100 -queryFileKey test-acer-best-query
The above command shows the Top-100 accuracy (a.k.a., Hit@100), MAP and MRR of the ACER queries
java -jar acer-runner.jar -task getCandidatePerformance -candidateKey method-sig -K 100
The above command shows the Top-100 accuracy (a.k.a., Hit@100), MAP and MRR of the method-sig
based candidate queries
java -jar acer-runner.jar -task getBaselinePerformance -K 100
The above command shows the Top-100 accuracy (a.k.a., Hit@100), MAP and MRR of the baseline queries.
java -jar acer-runner.jar -task getACERPerformanceQE -queryFileKey test-acer-best-query
The above command shows improvement, worsening and preserving ratios of the baseline queries by ACER
java -jar acer-runner.jar -task getCandidatePerformanceQE -candidateKey method-sig
The above command shows improvement, worsening and preserving ratios of baseline queries by method-sig
based reformulation candidates
@INPROCEEDINGS{ase2017masud,
author={Rahman, M. M. and Roy, C. K.},
booktitle={Proc. ASE},
title={Improved Query Reformulation for Concept Location using CodeRank and Document Structures},
year={2017},
pages={428--439}}
Contact: Masud Rahman (masud.rahman@usask.ca)
OR
Create an issue from here