Siamese: Code Clone Search Engine
Siamese (Scalable, incremental, and multi-representation) is a code clone search system powered by Elasticsearch with code clone detection approaches, including code normalisation, n-grams, and query reduction technique, built on top. It can scalably search for clones of Type-1 to Type-3/Type-4 from a large corpora of Java source code within seconds.
Build from Source:
1. Download elasticsearch-2.2.0 and extract to disk.
mkdir ~/siamese
cd ~/siamese
wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.2.0/elasticsearch-2.2.0.tar.gz
tar -xvf elasticsearch-2.2.0.tar.gz
rm elasticsearch-2.2.0.tar.gz
2. Modify the configuration file in config/elasticsearch.yml
cd elasticsearch-2.2.0
vim config/elasticsearch.yml
Add the following lines at the end of the file. Save and quit.
cluster.name: stackoverflow
index.query.bool.max_clause_count: 4096
3. Clone the project from GitHub.
cd ~/siamese
git clone https://github.com/UCL-CREST/Siamese.git
4. Install JDK and Maven
sudo apt-get install default-jdk
sudo apt-get install maven
5. Check if you can call javac
.
javac
If javac
does not produce any results, your JAVA_HOME
is not set, set the JAVA_HOME by opening the file /etc/environment
vim /etc/environment
and paste the location of JAVA_HOME at the end of the file. You can locate JAVA_HOME by
whereis javac
ls -l <the path>
... keep following the path until you find the real path (not a symlink) to the javac
5. Modify the location of elasticsearch in config.properties
.
elasticsearchLoc=/my/dir/elasticsearch-2.2.0
Save and quit.
cd Siamese
vim config.properties
6. Try starting the elasticsearch service
./elasticsearch-2.2.0/bin/elasticsearch
You should see elasticsearch execution log like this.
[2018-10-02 03:50:35,305][INFO ][node ] [Warlock] version[2.2.0], pid[27101], build[8ff36d1/2016-01-27T13:32:39Z]
[2018-10-02 03:50:35,305][INFO ][node ] [Warlock] initializing ...
[2018-10-02 03:50:35,658][INFO ][plugins ] [Warlock] modules [lang-expression, lang-groovy], plugins [], sites []
[2018-10-02 03:50:35,674][INFO ][env ] [Warlock] using [1] data paths, mounts [[/ (/dev/sda2)]], net usable_space [107.8gb], net total_space [202.6gb], spins? [no], types [ext4]
[2018-10-02 03:50:35,674][INFO ][env ] [Warlock] heap size [989.8mb], compressed ordinary object pointers [true]
[2018-10-02 03:50:36,919][INFO ][node ] [Warlock] initialized
[2018-10-02 03:50:36,919][INFO ][node ] [Warlock] starting ...
[2018-10-02 03:50:36,982][INFO ][transport ] [Warlock] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2018-10-02 03:50:36,989][INFO ][discovery ] [Warlock] stackoverflow/VPfoqhukSoiP7RtKKgvYmg
[2018-10-02 03:50:40,037][INFO ][cluster.service ] [Warlock] new_master {Warlock}{VPfoqhukSoiP7RtKKgvYmg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2018-10-02 03:50:40,063][INFO ][http ] [Warlock] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2018-10-02 03:50:40,064][INFO ][node ] [Warlock] started
[2018-10-02 03:50:40,101][INFO ][gateway ] [Warlock] recovered [0] indices into cluster_state
Then, kill the process (Ctrl+C) and start the elasticsearch engine as a background service (with -d
flag).
./elasticsearch-2.2.0/bin/elasticsearch -d
You can also test that elasticsearch is running in the background by issuing the command below.
curl -XGET 'localhost:9200/_cat/indices?v&pretty'
You should see the output like this, which means there is no index in elasticsearch yet.
health status index pri rep docs.count docs.deleted store.size pri.store.size
7. Create an executable jar and copy to the Siamese home directory
cd Siamese
mvn compile package
cp -i target/siamese-0.0.*.jar .
8. Try to execute Siamese.
java -jar siamese-0.0.6-SNAPSHOT.jar
9. You will see how to execute Siamese printed on the screen.
$ java -jar siamese-0.0.6-SNAPSHOT.jar
usage: \(v 0.6\) $java -jar siamese.jar -cf <config file> [-i input] [-o
output] [-c command] [-h help]
Example: java -jar siamese.jar -cf config.properties
Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o
/my/output/dir -c index
-c,--command <arg> [optional] command to execute [index, search].
This will override the configuration file.
-cf,--configFile <arg> [* requried *] a configuration file
-h,--help <optional> print help
-i,--inputFolder <arg> [optional] location of the input files \(for
index or query\). This will override the
configuration file.
-o,--outputFolder <arg> [optional] location of the search result file.
This will override the configuration file.
10. An example of running Siamese to index a project "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties
11. Then, tell Siamese to search for clones of "bar" in the index of "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties
12. After Siamese finishes its execution, the output file (clone classes) will be located at /my/output/dir
.
The file will be using the pattern data_qr_<timestamp>.xml
.
13. If you want to enforce similarity threshold on the search results,
modify the config.properties
file to enable fuzzywuzzy or tokenratio (recommended) similarity.
Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.
computeSimilarity : tokenratio
simThreshold : 50%,50%,50%,50%
Downloads
-
Executable Tool (JAR file):
-
Siamese: Siamese executable can be downloaded here: Siamese v. 0.6. Please make sure you have Java 8 installed on your machine.
1. To execute Siamese, unzip the file and follow the steps below:
$cd siamese $./elasticsearch-2.2.0/bin/elasticsearch -d $java -jar siamese-0.0.5-SNAPSHOT.jar
Then you'll see the usage and example of how to use Siamese.
usage: (v 0.5) $java -jar siamese.jar -cf <config file> [-i input] [-o output] [-c command] [-h help] Example: java -jar siamese.jar -cf config.properties Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o /my/output/dir -c index -c,--command <arg> [optional] command to execute [index, search]. This will override the configuration file. -cf,--configFile <arg> [* requried *] a configuration file -h,--help <optional> print help -i,--inputFolder <arg> [optional] location of the input files (for index or query). This will override the configuration file. -o,--outputFolder <arg> [optional] location of the search result file. This will override the configuration file.
2. An example of running Siamese to index a project "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties
3. Then, tell Siamese to search for clones of "bar" in "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties
4. After Siamese finishes its execution, the output file (clone classes) will be located at
/my/output/dir
. The file will be using the patterndata_qr_<timestamp>.xml
.5. If you want to enforce similarity threshold on the search results, modify the
config.properties
file to enable fuzzywuzzy or tokenratio (recommended) similarity. Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.computeSimilarity : tokenratio simThreshold : 50%,50%,50%,50%
-
BigCloneEval: BigCloneEval is a tool for automated recall evaluation based on BigCloneBench data set. It can be downloaded from: BigCloneBench
-
-
Data sets: the data sets that we used to evaluate Siamese are listed below:
- OCD (Obfuscation/Compilcation/Decompilation) data set. The OCD data set is from a study by Ragkhitwetsagul et al. and can be found here: OCD data set.
- SOCO (SOurce COde Re-use) data set. The SOCO data set was created for the detection of source code reuse competition and can be downloaded here: SOCO. However, the clone oracle has some issues which Ragkhitwetsagul et al. found and fixed. Please download the corrected clone oracle from: Fixed clone oracle.
- BigCloneBench data set. The BigCloneBench is created by Svajlenko et al., it is one of the largest clone benchmarks available to date. It is created from IJaDataset 2.0 of 25,000 Java systems. The benchmark contains 2.8 million files with 8 million manually validated clone pairs of type-1 up to type-4. The data set and the clone oracle can be downloaded here: BigCloneBench.
- GitHub data set. We used 16,738 and 130,719 GitHub Java projects to evaluate Siamese's precision and incremental update module. Since the projects are all open source, you can download the GitHub projects from GitHub directly. The list of the projects we used can be found below:
- 10 highest-voted Stack Overflow code snippets We reused the code snippets from Kim et al.'s study. The 10 code queries from the 10 highest-voted Stack Overflow code snippets can be found here: FaCoy website
-
Additional Evaluation Results:
- RQ2 Comparison with Code Search Tools Due to limited space, we do not include all the results from using the 10 highest-voted Stack Overflow posts in the paper. We thus include them here.
- The full search results can be found here
-
How to read the results Siamese search results include multiple parts: (1) file path, (2) method name, (3) starting and ending line.
-
For example, a clone pair of
10_so/299495_0.java_paintComponent#22#26
andmattibal/meshnet/MeshNetBase/src/com/mattibal/meshnet/utils/color/gui/LabChooserJFrame.java_paintComponent#89#95
means the methodpainComponent
in the file10_so/299495_0.java
from line number 22 to 26 is a clone of the methodpaintComponent
in the filemattibal/meshnet/MeshNetBase/src/com/mattibal/meshnet/utils/color/gui/LabChooserJFrame.java
from line 89 to line 95.
-
Contact:
If you have any questions or find any issues, please contact Chaiyong Ragkhitwetsagul at cragkhit [at] gmail [dot] com
or Jens Krinke at j.krinke [at] ucl [dot] ac [dot] uk
.