Implemented levenshtein distance on CUDA device
This implementation uses simple python arrays/lists. The dp algorithm progressively fills the ED matrix and arrives at the solution. However, all the steps in this algo are highly sequential, taking up a lot of time in processing.
The bottleneck of the previous implementation is the sequential dependency. An element in a particular row and column can only be computed if all elements in the earlier rows and columns were computed.
Since in each step, we are performing multiple computations, like:
- incrementing upper matrix cells by 1
- incrementing diagonally top-left cell by 1 depending on equality of corresponding characters
- incrementing the left cell elements
We can see that, in computing the current cell in a row, the sequential dependency only kicks in when reading value from left. So, we can isolate this part of computation until necessary, while performing other calculations in parallel|bulk without any restricstions.
###1.3 GPU implementation of ED (computing diagonal wise)
Row wise computation vs diagonal wise computation
>>> [0 1 2 3 4 5 6 7 8 9]
>>> [1 0 1 2 3 4 5 6 7 8]
>>> [2 1 1 2 3 3 4 5 6 7]
>>> [3 2 1 1 2 3 3 4 5 6]
>>> [4 3 2 1 2 3 3 4 5 6]
>>> [5 4 3 2 2 3 4 4 4 5]
>>> [6 5 4 3 3 2 3 4 5 5]
>>> [7 6 5 4 4 3 3 3 4 5]
>>> [8 7 6 5 5 4 4 4 3 4]
[8 7 6 5 5 4 4 4 3 4]
>>> [0]
>>> [1 1]
>>> [2 0 2]
>>> [3 1 1 3]
>>> [4 2 1 2 4]
>>> [5 3 1 2 3 5]
>>> [6 4 2 1 3 4 6]
>>> [7 5 3 1 2 3 5 7]
>>> [8 6 4 2 2 3 4 6 8]
>>> [7 5 3 2 3 3 5 7 9]
>>> [6 4 3 3 3 4 6 8]
>>> [5 4 2 4 4 5 7]
>>> [5 3 3 4 5 6]
>>> [4 3 4 4 6]
>>> [4 3 5 5]
>>> [4 4 5]
>>> [3 5]
>>> [4]
[8 7 6 5 5 4 4 4 3
In the GPU implementation the diagonals are computed one after the other. But the entire diagonal can be computed at once, which is implemented using CUDA kernel.
With NVIDIA Tesla T4, the CUDA implementaion is much faster (10-15times faster in sequence of length > 100k) than the CPU implementations:
The graph is in the log scale: