How to write cuda code of the multilayer units
haoyz opened this issue · 1 comments
This tutorials helped me to write a single layer unit with CUDA code.
But how to write CUDA code of the multilayer units, like torch/nn/_functions/rnn.py 281?
output, hy, cy, reserve, new_weight_buf = torch._cudnn_rnn(
input, weight_arr, weight_stride0,
flat_weight,
hx, cx,
mode, hidden_size, num_layers,
batch_first, dropout, train, bool(bidirectional),
list(batch_sizes.data) if variable_length else (),
dropout_ts)
I have achieved the same results by using the template of AutogradRNN, i.e., torch/nn/_functions/rnn.py 212.
def AutogradRNN(mode, input_size, hidden_size, num_layers=1, batch_first=False,
dropout=0, train=True, bidirectional=False, variable_length=False,
dropout_state=None, flat_weight=None):
But gpu utilization was too low and speed was too slow. Perhaps because each single layer unit is called individually, which involve launch of a CUDA kernel. So I want to rewrite multilayer units in CUDA and fuse particular groups of single layer. Can you provide a boilerplate?
There are two classes about lstm in pytorch, LSTMCell and LSTM. The former is a single layer and can just receive one step input. The latter can stack several layers, receive several step input and allow bidirctional input. I think the tutorial is about teaching people how to build a LSTMCell-like unit named lltm, but I wonder how to build a LSTM-like unit using CUDA. Any ideas? Or any suggestions?@ClementPinard