/CloneWorks

Primary LanguageHTMLOtherNOASSERTION

CloneWorks

Please send your questions and bug reports to: jeff.svajlenko@gmail.com. I am happy to assist you in using CloneWorks.

Please refer to the publication for additional details. A link to the paper is provided at jeff.svajlenko.com/cloneworks. There is also a tool demonstration video provided.

Setup

CloneWorks can be used on Linux and other unix-like systems. It has been tested on the latest Ubuntu and OS X. It should also work on Cygwin, but the input builder is slower due to process creation being slower on Windows. We plan to support Windows 10 "redstone" update with its native Ubuntu kernel.

CloneWorks requires FreeTXL 10.6 or later, which can be download from: http://www.txl.ca/ It should be installed at a system level. If installed at a user-level, you will need to add txl on your $PATH variable.

CloneWorks also requires the following: - gcc, make (build-essential on ubuntu) - java (7+)

Run ./build from the install directory to compile.

CloneWorks requires configuration for your available memory. By defualt the input builder is set to use up to 2GB of memory, and the clone detector up to 4GB of memory. These can be modified by editing the cwbuild and cwdetect scripts. The input builder is probably fine with 2GB unless you receive an out of memory error from the JVM. The clone detector should be set for your available memory, leaving room for system usage.

By defualt, CloneWorks must be executed from its installation directory. To run CloneWorks from another directory, the cwbuild and cwdetect scripts must be modified with 'INSTALL_DIR=...' set to the install directory (absolute path). If the installation directory is specified, cwbuild and cwdetect can also be added to the PATH variable for execution anywhere.

The TXL scripts need to be compiled for your computer. Enter the txl/ directory and run make.

Testing CloneWorks

Once installed, CloneWorks can be tested by executing the following commands in the installation directory:

./cwbuild -i example/JHotDraw54b1/ -f example/JHotDraw.files -b example/JHotDraw.fragments -l java -g function -c type3_conservative ./cwdetect -i example/JHotDraw.fragments -o example/JHotDraw.clones -s 0.7

This performs Type-3 clone detection on JHotDraw. The clones are in the file example/JHotDraw.clones, the code fragments are in example/JHotDraw.fragments, and the mapping of ID to source-file paths are in example/JHotDraw.files.

Using CloneWorks

The input builder (cwbuild) is used to prepare the source code for clone detection, which is performed by the clone detector (cwdetect).

Input Builder

The input builder has the following usage:

usage: cwbuild -i -f -b -l <str [str2 st3 ...]> -g -c [-t ] [-mil ] [-mal ] [-mit ] [-mat ] CloneWorks-InputBuilder - CloneWorks input builder. -i,--input Input subject system. Either the root directory of the system, or a file containing a list of source file paths. -f,--fileids File to write the FileID<->FilePath mapping to. -b,--blocks File to write the parsed blocks to. -l,--language <str [str2 st3 ...]> The language of the input system. One of: {java,c,cpp,cs,python}. -g,--granularity The block granularity. One of: {block,function,file}. -c,--configuration The configuration file. Name of a file in INSTALL_DIR/config/. -t,--num-threads The number of threads to use per execution task. Default is number of available cores. -mil,--min-lines Minimum code fragment size in (original) source lines. -mal,--max-lines Maximum code fragment size in (original) source lines. -mit,--min-tokens Minimum code fragment size in (original, pre-processing) parsed tokens. -mat,--max-tokens Maximum code fragment size in (original, pre-processing) parsed tokens.

You must specify the directory of source code to process, a file to output a mapping of the source file paths to IDs, and a file to output the code fragments to. The language(s) of the soruce files of interest must be specified, as well as the code granularity for extraction. The configuration file from the config/ directory needs to also be supplied.

Optionally, the minimum and maximum size of the code fragments to consider (by source line and terms) can also be specified. The target number of threads per task can also be provided, by default the number of available cores is used.

The input builder's code-fragment processors, term splitting and term processors are specified in the configuration file. The name of the configuration file must be provided. The configuration files are stored in the config/ directory, including a template and some sample configurations for generic clone detection with the first three clone types.

For information on implementing your own code-fragment processors and term processors please see the documentation and examples in the cfprocessors/ and termprocessors/ directory.

Clone Detector

The clone detector has the following usage:

usage: cwdetect -i -o -s [-p ] [-t ] [-mil ] [-mal ] [-mit ] [-mat ] [-ps] Clone detection with CloneWorks. -i,--input File containing blocks produced by thrifty-builder. -o,--output File to output clones to. -s,--min-similarity Minimum clone similarity. -p,--partition-mode Run using index-partitioning mode with the specified maximum partition size in code blocks. Use when index size exhausts available memory. -t,--num-threads Number of execution threads to use per task. Defaults to number of cores. -mil,--min-lines Minimum clone size in (original) source lines. -mal,--max-lines Maximum clone size in (original) source lines. -mit,--min-tokens Minimum clone size in (original, pre-processing) parsed tokens. -mat,--max-tokens Maximum clone size in (original, pre-processing) parsed tokens. -ps,--pre-sorted Indicates input blocks are pre-sorted, and GTF-sorting should be skipped. This makes clone detection with pre-sorted blocks more efficient. If the blocks are not pre-sorted, or are sorted incorrectly, this can cause false negatives. Skipping sorting on pre-sorted blocks can improve performance.

You must specify the code fragment file to detect clones within, a file to output the clones to, and the minimum clone similarity threshold in range [0.0,1.0]. Optionally, the minimum and maximimum clone sizes can be specified.

If there are too many code fragments to fit in memory, then the partitioned partial indexes pappraoch can be eanbled with the "-p" flag, including the maximum number of code fragments in each partition. This number needs to be found experimentally. However, the first step is to load the code fragmnets into memory, so an out of memory exception should be encountered early if it is going to happen.

In our expirence, we found we can comfortably load 500000 code fragments into memory given a Java heap size of 8GB, with actual usage not exceeding approx. 6GB. This option is not needed for small inputs.

The "-ps" flag can be used to skip sorting the blocks. This is for use when the code-fragments are pre-sorted, or where the code fragments only have a single term (as is the case with our Type-1 and Type-2 configurations). Sorting the code fragment's terms is not the expensive part of clone detection, so its safer just to skip this flag unless you know your blocks are sorted properly and need the small performance boost by skipping sorting.

Clones are found in the output file in the following format:

10,40,50,20,60,70

This indicates a clone pair between lines 40-50 (inclusive) in source file 10 and lins 60-70 (inclusive) in source file 20. The mapping of IDs to source paths are in the fileids file produced by the input builder.