When I use 28 seqence length for LSTM and TCN, LSTM is much faster than TCN.

Question

When I use 28 seqence length for LSTM and TCN, LSTM is much faster than TCN.

KinWaiCheuk opened this issue 5 years ago · 7 comments

It seems to me that LSTM is faster when the sequence length is short (say 28).
When the sequence length is long (say 784), LSTM will be much slower than TCN.

It seems to me for TCN, the computation time is independent of the sequence length.

Am I correct?

Answer 1 · 2019-12-21T17:36:19.000Z

Not necessarily, LSTM is slower than TCN on long sequences because recurrent networks process the tokens sequentially whereas TCN can perform convolution operation in parallel. However, when the sequence is long enough, you should still expect a slowdown because you only have limited CUDA kernels (or CPU compute).

Answer 2 · 2019-12-22T01:27:04.000Z

But when I am trying to do the sequential MNIST in 28X28 fashion (each sequence has a length of 28 and 28 sequences in total), LSTM is much faster than TCN.

Here's my training for LSTM, which takes only 13 seconds for each epoch
Here's my training for TCN, which takes almost 40 seconds for each epoch.

Am I doing anything wrong here?

Answer 3 · 2019-12-22T01:41:05.000Z

Nope, I think it depends more on your dilation configuration, batch size, # of parameters, # of LSTM layers and the compute resource you use than merely the architectural differences. However, you should expect good parallelism from TCN, which offers great advantages as the seq length gets longer.

Answer 4 · 2019-12-22T01:55:24.000Z

I see your point. When the seq length is longer, I do see the advantage of TCN being able to parallelize.

One last question, in your paper, did you compare TCN on 784X1 sequential MNIST to LSTM on 28X28 sequential MNIST?
My LSTM has a really poor performance when training on 784X1 sequential MNIST. Basically it doesn't learn, its accuracy is only around 0.12.

Answer 5 · 2019-12-22T03:07:29.000Z

Try to tune the forget gate bias. I think it should reach an accuracy of about 90%.

Answer 6 · 2019-12-22T10:56:04.000Z

I have been trying to replicate the same result that you reported on your paper. But the LSTM results are always worse that what you reported.

After initializing the forget gate bias to 1, I did get a better result. But I am still unable to get as good as 90% accuracy in 20 epochs. I have added gradient clipping to 1 and use RMSprop as the gradient descent. But I can only get at most 85% accuracy.

Here is my code. And I missing anything?

Answer 7 · 2019-12-22T16:10:23.000Z

No, I think you are doing things correctly. Don't use RMSprop, I think Adam would work just fine. You can try to tune gradient clipping as well.