The goal of this work is to evaluate the GPU techniques proposed for exact set similarity join and compare them to the best CPU standalone alternative.
Exact set similarity join is a notoriously expensive operation, for which several solutions have been proposed. Recently, there have been studies that present a comparative analysis using MapReduce or a non-parallel setting. Our contribution is that we complement these works through conducting a thorough evaluation of the state-of-the-art GPU-enabled techniques. These techniques are highly diverse in their key features and our experiments manage to reveal the key strengths of each one. As we explain, in real-life applications, there is neither a dominating solution nor a solution without its own sweet spot including not using the GPU at all. All our work is repeatable and extensible.
CPU standalone: http://ssjoin.dbresearch.uni-salzburg.at/
CPU-GPU: https://github.com/chribell/gpussjoin/
gSSJoin: this repo
fgSSJoin: this repo
sf-gSSJoin: this repo
bitmap: this repo
cd (gssjoin|fgssjoin|sf-gssjoin|bitmap) && mkdir release && cd release
#for compute capability 6.1
CC=gcc-5 CXX=g++-5 cmake -DCMAKE_BUILD_TYPE=Release -DSM_ARCH=61 CMAKE_CXX_FLAGS=-m64 ..
./gssjoin <input_file> <threshold> <output_file> <number_of_gpus> <aggregate>
<input_file> file, each line a record
<threshold> normalized threshold
<output_file> file to output pairs (if not aggregate)
<number_of_gpus> the number of gpus to be involved in the set similarity join
<aggregate> flag if to perform an aggregation on top of the join (0 | 1)
./fgssjoin <input_file> <threshold> <output_file> <number_of_gpus> <size of blocks> <aggregate>
<input_file> file, each line a record
<threshold> normalized threshold
<output_file> file to output pairs (if not aggregate)
<number_of_gpus> the number of gpus to be involved in the set similarity join
<size of blocks> the size of each input collection partition block
<aggregate> flag if to perform an aggregation on top of the join (0 | 1)
./sfgssjoin <input_file> <threshold> <output_file> <number_of_gpus> <size of blocks> <aggregate>
<input_file> file, each line a record
<threshold> normalized threshold
<output_file> file to output pairs (if not aggregate)
<number_of_gpus> the number of gpus to be involved in the set similarity join
<size of blocks> the size of each input collection partition block
<aggregate> flag if to perform an aggregation on top of the join (0 | 1)
./bitmap --input <input_file> --threshold <threshold> --bitmap <bitmap>
<input_file> file, each line a record
<threshold> normalized threshold
<bitmap> bitmap signature size (bits)
The datasets and the preprocess scripts can be found at http://ssjoin.dbresearch.uni-salzburg.at/datasets.html