A CUDA implementation of UShapelet-based Clustering. Completed as a project for CSCI-5551 (Parallel and Distributed Systems).
Requires Python 3 with the following dependencies:
numpy
pandas
scikit-learn
pycuda
matplotlib
To use the program run python3 Driver.py <input_file> <cuda_flag>
. <cuda_flag>
should be 1
to enable CUDA, or any other value for the sequential version of the algorithm. I have added two sample input files Trace.txt
and FourClasses.txt
to the repo.
I originally identified three areas for parallelization:
- SAX Word Hashing
- Hash Collision Checking
- Gap Score Computation
I was able to parallelize two of my three original targets (SAX Word Hashing and Gap Score Computation). I was not able to complete the Hash Collision Checking parallelization in time. However, my CUDA implementation runs much faster than the sequential version, even without the collision checking parallelization. This is because the sequential implementation spends the majority of its time performing Gap Score computations, which are much faster in the CUDA implementation.
Driver.py
- This is the main file used to run the program.
UShapelet.py
- This file contains the implementations for UShapelet computation. I use a flag
use_cuda
to determine which sections of the code run in sequentially and which parts run in parallel. Setting the<cuda_flag>
parameter on the command line will toggle this on or off for all functions in the code.
- This file contains the implementations for UShapelet computation. I use a flag
cuda_helper.py
- This file wraps some of the boilerplate code for CUDA and PyCuda setup. The functions here allocate memory and copy parameters to the GPU before calling the CUDA kernels.
cuda_modules.py
- This file contains the actual CUDA kernels used to perform each computation. There are two kernels for acclerating the PAA/SAX process and the Gap Score process,
called
compute_PAA
andcompute_sdist
respectively.
- This file contains the actual CUDA kernels used to perform each computation. There are two kernels for acclerating the PAA/SAX process and the Gap Score process,
called
This program uses a Rand Index score to compute the quality of the clustering generated by the algorithm. The Rand Index measures the similarity between two sets of clusterings. In our case, both sample files come with cluster labels, so the Rand Index between the "true" labels and the algorithms labels is a good measure of the accuracy of our algorithm. After each shapelet extracted, the program will print out the time taken (in seconds) to complete each section of the program.
For FourClasses.txt
the algorithm clusters perfectly, with a Rand Index of 1.
For Trace.txt
the algorithm gets a Rand Index of ~0.71.
The following charts show the speedup gained from the parallelization.
For this section, tests were run using the Trace.txt
input file, as it is the larger of the two sample inputs.
As expected, the Gap Score computation is the largest contributor to the total program runtime.
For Trace.txt
, the sequential version takes almost 30 seconds to perform the Gap Score computation, while the CUDA version performs the same operation in less than one second.
In the CUDA version, the Hash Collision check is now the longest operation, with an average time of ~3 seconds per shapelet. This is a large improvement over the sequential version.
Overall, I think that parallelizing this algorithm with CUDA was a good idea. Even though I was not able to complete all three targets for my parallelization, the current version of my algorithm vastly outperforms the sequential version, with no loss in cluster quality.
In the future, I would like to parallelize the Hash Collision Checking portion of the program, so that all three of my parallelization targets will be complete. I would also like to investigate alternatives for computing the Gap Score. One method used by the authors for computing Gap Score involves taking a "fast" dot product between two vectors using FFT and Inverse FFT. GPUs can perform these operations quickly, so it could be possible to create an algorithm that performs Gap Score computation faster than my current implementation.