pytorch/extension-cpp

How to write cuda code of the multilayer units

haoyz opened this issue · 1 comments

haoyz commented

This tutorials helped me to write a single layer unit with CUDA code.
But how to write CUDA code of the multilayer units, like torch/nn/_functions/rnn.py 281?

output, hy, cy, reserve, new_weight_buf = torch._cudnn_rnn(
           input, weight_arr, weight_stride0,
           flat_weight,
           hx, cx,
           mode, hidden_size, num_layers,
           batch_first, dropout, train, bool(bidirectional),
           list(batch_sizes.data) if variable_length else (),
           dropout_ts)

I have achieved the same results by using the template of AutogradRNN, i.e., torch/nn/_functions/rnn.py 212.

def AutogradRNN(mode, input_size, hidden_size, num_layers=1, batch_first=False,
                dropout=0, train=True, bidirectional=False, variable_length=False,
                dropout_state=None, flat_weight=None):

But gpu utilization was too low and speed was too slow. Perhaps because each single layer unit is called individually, which involve launch of a CUDA kernel. So I want to rewrite multilayer units in CUDA and fuse particular groups of single layer. Can you provide a boilerplate?

haoyz commented

There are two classes about lstm in pytorch, LSTMCell and LSTM. The former is a single layer and can just receive one step input. The latter can stack several layers, receive several step input and allow bidirctional input. I think the tutorial is about teaching people how to build a LSTMCell-like unit named lltm, but I wonder how to build a LSTM-like unit using CUDA. Any ideas? Or any suggestions?@ClementPinard