batzner/indrnn

Result of Sequential MNIST

Sunnydreamrain opened this issue · 14 comments

Hi,

First, thanks a lot for this example.
I just noticed that you wrote "I let it run for two days and stopped it after 60,000 training steps ". In your example, LEARNING_RATE_DECAY_STEPS=600000 meaning learning rate starts to drop after 600,000 training steps. Does this mean that the result shown on your page is obtained before decaying the learning rate?

If this is the case, further dropping the learning rate might improve the performance.
Also, from your results, although the validation error keeps dropping, it drops relatively slow. So I think maybe by setting LEARNING_RATE_DECAY_STEPS=20000, it gives you a better result than the one you presented (although it is not the best).

By the way, "I let it run for two days and stopped it after 60,000 training steps", it seems much slower than mine. I am not very familiar with TensorFlow, but did it compute the input*W for all the time steps together? If not, then I would suggest removing this process from the IndRNN cell and add an extra layer computing this. This could improve the efficiency a lot I think.

Thanks.

Thanks for your input.

Does this mean that the result shown on your page is obtained before decaying the learning rate?

Yes. Decaying the learning rate faster might help to get from 1.1% error to the 1.0% reported in the paper.

it seems much slower than mine

Yes, I assumed that your experiments ran a lot faster, since mine would have ran for 20 days before dropping the learning rate for the first time. In your code, I saw that you used sequence-wise BN after each layer. This cannot be built into the IndRNNCell code itself, as the call function only has access to the current time step's input. So, I needed to unroll the RNN for every layer, i.e. six times. This caused quite a performance drop, but I still wanted to use IndRNNCell to provide an example of its usage. On CPU, the TensorFlow code runs with the same speed (5 min per 100 batches) as the Theano/Lasagne implementation (6 min per 100 batches). I was not able to test it on a GPU, but I expect the TensorFlow code to be a lot slower.

What might also improve the validation performance is using running averages for the BN population statistics like you did in your implementation. I used a batch of 500 training examples to set the population statistics before every validation run and 15000 training examples for the final validation and test run. This gave better estimates for the validation error during the first training steps. With running averages, the initial validation errors oscillated between 0.1 and 0.8, but I assume the validation error becomes more and more stable during the training. I wonder if keeping running averages outperforms recalculating population statistics on 15000 training examples in the long run.

Did you also try frame-wise batch normalization on the Sequential MNIST problem? How did it perform?

Hi,

Did you also try frame-wise batch normalization on the Sequential MNIST problem? How did it perform?

I tried to use sequence-wise BN whenever possible in my experiments because this reduces the number of parameters used in the model (although those parameters are just mean and variance.). I have tried the frame-wise BN before. I can only vaguely remember the performance should be similar, but the training may be less robust. I'll try that again and let you know the results.

So, I needed to unroll the RNN for every layer, i.e. six times. This caused quite a performance drop, but I still wanted to use IndRNNCell to provide an example of its usage.

Okay. But it is possible to compute the input*W for all time steps together if we want to speed up, right?

Thanks

I'll try that again and let you know the results.

Thank you very much. I am interested in the frame-wise BN results.

But it is possible to compute the input*W for all time steps together if we want to speed up, right?

Yes, that is possible. I will try to find a more efficient way in the next days.

Hi,
I have tried the frame-wise BN, and the result is worse than the sequence-wise BN.
If use the mean and variance of the current batch, error rate 3.4%.
If use the mean and variance of all the training history, it does not work.
I think the worse performance may be because of the unstable statistics in BN. Compared with sequence-wise BN, the statistics of frame-wise BN is only obtained for each time step of a small batch.

Thank you for sharing your results! The unstable statistics of frame-wise BN would explain the worse performance. Which batch size did you use for the frame-wise BN? Do you have access to the activations of the trained network across time steps? It would be interesting to see, whether the population statistics change over time, like reported in Figure 5 of the Recurrent Batch Normalization paper.

The batch size is 32, the same with the earlier experiments.
The figure shown is the convergence of statistics over different time steps. But I was talking about the statistics of different samples. Let me try to get some statistics over time and different samples first. Will let you know when I have it.

Hi @Sunnydreamrain , could you please create a branch to share the frame-wise bn version of IndRNN_Theano_Lasagne?

@Sunnydreamrain Thank you. l have one question. You do frame-wise bn after/before rnn model, but not in the recurrent process, right? In the Recurrent Batch Normalization paper, it seems that they do frame-wise bn for each step in the recurrent process.

Before or after the recurrent part of IndRNN. "after" means using BN after the IndRNN. "Before" means using BN after the input process (Wx) but before the recurrent process (Wx+uh). I think it is the same thing as frame-wise BN.

@batzner Hi, I have done the experiment on the mean and std of BN as the one you mentioned earlier.

First, for sequence-wise BN: following is the mean and the inverse_std of different batches and the average one saved in the model for all the neurons. You can see that the mean and inverse_std of different batches are almost the same as the average one, indicating that the mean and inverse_std are very stable over the dataset.

Second, for frame-wise BN: following is the mean and the inverse_std of different batches and the average one saved in the model, for one random neuron over all the timesteps. You can see that the mean and inverse_std of different batches are very different, indicating that the mean and inverse_std are not stable over different batches.

Third, for frame-wise BN: following is the average mean and the inverse_std saved in the model, for several random neurons over all the timesteps (the same one as you mentioned). It is strange. Something must be not working well. But I haven't figured out it yet. The obvious reason is the data used for estimating mean and inverse_std is too small, but seems not very convincing.

@Sunnydreamrain Thank you very much for sharing your results. So, if I interpret it correctly,

  1. the first figure implies that for sequence-wise normalization, the batch size of 32 is sufficiently large to ensure stable population statistics.
  2. the second figure implies that for frame-wise normalization, a batch size of 32 is not sufficiently large, although the batch means are already close to the average saved in the model. It is interesting that the mean goes from being stationary to unstable over time, in contrast to Figure 5 in the Recurrent Batch Normalization paper.
  3. the third figure shows that while this stationary -> unstable transition over time seems to exists for all neurons, the means diverge into different directions. Why do you think that something must not be working well? To me, it seems to be a consistent behavior with respect to the second figure, where the mean becomes unstable as well.

Again, thank you for sharing your results, they are very interesting!

@batzner Hi, I have tested the frame-wise BN for the skeleton-based action recognition using NTU dataset. This is a pretty large dataset. The results are pretty good (maybe 1% lower than the sequence-wise BN). The batch size is 128. For batch size 32, the performance still drops (around 3%), but still much better than the sequential mnist (if use the average statistics, does not work at all). One difference between these two tasks is that for action recognition, only 20 time steps are used.
Yes. Maybe it is mainly because 32 (batch size) is too small, and when the sequence is too long, it makes the statistics too unstable.

Oh, also one more thing. I have re-tested the char-level PTB (where frame-wise BN has to be used), it seems that setting frame size to 256 improves the performance a little bit. This could still be because of the statistics.