CUDA-Warp RNN-Transducer

A GPU implementation of RNN Transducer (Graves 2012, 2013). This code is ported from the reference implementation (by Awni Hannun) and fully utilizes the CUDA warp mechanism.

The main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm. In particular, there is a nested loop to populate a lattice with shape (T, U), and each value in this lattice depend on the two previous cells from each dimension (e.g. forward pass).

CUDA executes threads in groups of 32 parallel threads called warps. Full efficiency is realized when all 32 threads of a warp agree on their execution path. This is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension. In each warp, variables between threads exchanged using a fast operations. As soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running. A schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size. The similar procedure for the backward pass runs in parallel.

Performance

Benchmarked on a GeForce GTX 1080 Ti GPU, Intel i7-8700 CPU @ 3.20GHz.

	warp_rnnt	warprnnt_pytorch	transducer
T=150, U=40, V=28
N=1	0.07 ms	0.68 ms	1.28 ms
N=16	0.33 ms	1.80 ms	6.15 ms
N=32	0.35 ms	3.39 ms	12.72 ms
N=64	0.56 ms	6.11 ms	23.73 ms
N=128	0.60 ms	9.22 ms	47.93 ms
T=150, U=20, V=5000
N=1	0.46 ms	2.14 ms	21.18 ms
N=16	1.42 ms	21.24 ms	240.11 ms
N=32	2.51 ms	38.26 ms	490.66 ms
N=64	out-of-memory	75.54 ms	944.73 ms
N=128	out-of-memory	out-of-memory	1894.93 ms
T=1500, U=300, V=50
N=1	0.60 ms	10.77 ms	121.82 ms
N=16	2.25 ms	97.69 ms	732.50 ms
N=32	3.97 ms	184.73 ms	1448.54 ms
N=64	out-of-memory	out-of-memory	2767.59 ms

Note

This implementation assumes that the input is log_softmax.
In addition to alphas/betas arrays, counts array is allocated with shape (N, U * 2), which is used as a scheduling mechanism.
core_gather.cu is a slightly memory-efficient version that expects log_probs with the shape (N, T, U, 2) only for blank and labels values.
Do not expect that this implementation will greatly reduce the training time of RNN Transducer model. Probably, the main bottleneck will be a trainable joint network with an output (N, T, U, V).
Also, there is a restricted version, called Recurrent Neural Aligner, with assumption that the length of input sequence is equal to or greater than the length of target sequence.

Install

Currently, there is only a binding for PyTorch 1.0 and higher.

pip install warp_rnnt

Test

There is a unittest in pytorch_binding/warp_rnnt which includes tests for arguments and outputs as well.

python -m warp_rnnt.test

Reference