ParallelFinalProject

A CUDA implementation of UShapelet-based Clustering. Completed as a project for CSCI-5551 (Parallel and Distributed Systems).

Requirements

Requires Python 3 with the following dependencies:

numpy
pandas
scikit-learn
pycuda
matplotlib

Usage

To use the program run python3 Driver.py <input_file> <cuda_flag>. <cuda_flag> should be 1 to enable CUDA, or any other value for the sequential version of the algorithm. I have added two sample input files Trace.txt and FourClasses.txt to the repo.

Implementation

I originally identified three areas for parallelization:

SAX Word Hashing
Hash Collision Checking
Gap Score Computation

I was able to parallelize two of my three original targets (SAX Word Hashing and Gap Score Computation). I was not able to complete the Hash Collision Checking parallelization in time. However, my CUDA implementation runs much faster than the sequential version, even without the collision checking parallelization. This is because the sequential implementation spends the majority of its time performing Gap Score computations, which are much faster in the CUDA implementation.

File Breakdown

Driver.py
- This is the main file used to run the program.
UShapelet.py
- This file contains the implementations for UShapelet computation. I use a flag use_cuda to determine which sections of the code run in sequentially and which parts run in parallel. Setting the <cuda_flag> parameter on the command line will toggle this on or off for all functions in the code.
cuda_helper.py
- This file wraps some of the boilerplate code for CUDA and PyCuda setup. The functions here allocate memory and copy parameters to the GPU before calling the CUDA kernels.
cuda_modules.py
- This file contains the actual CUDA kernels used to perform each computation. There are two kernels for acclerating the PAA/SAX process and the Gap Score process, called compute_PAA and compute_sdist respectively.

Results

Output

This program uses a Rand Index score to compute the quality of the clustering generated by the algorithm. The Rand Index measures the similarity between two sets of clusterings. In our case, both sample files come with cluster labels, so the Rand Index between the "true" labels and the algorithms labels is a good measure of the accuracy of our algorithm. After each shapelet extracted, the program will print out the time taken (in seconds) to complete each section of the program.

For FourClasses.txt the algorithm clusters perfectly, with a Rand Index of 1.

Without CUDA Acceleration

With CUDA Acceleration

For Trace.txt the algorithm gets a Rand Index of ~0.71.

Without CUDA Acceleration

With CUDA Acceleration

The following charts show the speedup gained from the parallelization. For this section, tests were run using the Trace.txt input file, as it is the larger of the two sample inputs.

Sequential Runtime per Section

As expected, the Gap Score computation is the largest contributor to the total program runtime.

Parallel Runtime per Section

For Trace.txt, the sequential version takes almost 30 seconds to perform the Gap Score computation, while the CUDA version performs the same operation in less than one second.

In the CUDA version, the Hash Collision check is now the longest operation, with an average time of ~3 seconds per shapelet. This is a large improvement over the sequential version.

Conclusion

Overall, I think that parallelizing this algorithm with CUDA was a good idea. Even though I was not able to complete all three targets for my parallelization, the current version of my algorithm vastly outperforms the sequential version, with no loss in cluster quality.

Future Work

In the future, I would like to parallelize the Hash Collision Checking portion of the program, so that all three of my parallelization targets will be complete. I would also like to investigate alternatives for computing the Gap Score. One method used by the authors for computing Gap Score involves taking a "fast" dot product between two vectors using FFT and Inverse FFT. GPUs can perform these operations quickly, so it could be possible to create an algorithm that performs Gap Score computation faster than my current implementation.

Andrew0Hill/UShapelet_CUDA

ParallelFinalProject

Requirements

Usage

Implementation

File Breakdown

Results

Output

Without CUDA Acceleration

With CUDA Acceleration

Without CUDA Acceleration

With CUDA Acceleration

Sequential Runtime per Section

Parallel Runtime per Section

Conclusion

Future Work