The dropout of Transformer

Question

The dropout of Transformer

Closed this issue 2 years ago · 7 comments

Dear authors:

I found the dropout (embd_pdrop, resid_pdrop, attn_pdrop) is set to 0 during the GPT training.
To verify my observations, I downloaded the TATS-base of UCF101 and Sky-Timelapse from the homepage, the embd_pdrop, resid_pdrop, attn_pdrop were all set to 0.

0 means the dropout does not work. So I want to check is this correct? Or do I miss something?

Kang

Answer 1 · 2022-08-08T23:11:04.000Z

Hi Kang,

Thank you for your questions. That is right, we set the dropout to be 0 based on our initial observations that it did not affect much on the performance. However, that might not be optimal as we did not spend a great bunch of time in sweeping this hyperparameter.

Songwei

Answer 2 · 2022-08-09T10:15:53.000Z

Thanks for your reply.

Another question is the learning rate of VQGAN:

From your paper, B.2 part, you wrote lr = 3e-5;
According to your code, args.lr = accumulate * (ngpu/8.) * (bs/4.) * base_lr;
Based on the checkpoint VQGAN (for UCF 101), the lr is 9e-5;

which one should I follow?

Answer 3 · 2022-08-09T19:44:45.000Z

I think you just need to set --lr 3e-5. The final learning rate (also stored in the checkpoint) is calculated based on your second point. As the default setting, I had ngpu=8, bs=2, accumulate=6, therefore lr=6 * (8/8) * (2/4) * 3e-5 =9e-5.

Answer 4 · 2022-08-11T03:54:52.000Z

Got it!

In B.2, you wrote "We train the VQGAN on 8 NVIDIA V100 32GB GPUs with batch size = 2 on each gpu and accumulated
batches = 6 for 30K steps"

If I use GPU with bigger Memory, and make the total batch of one step is 96 (=8 * 2 * 6), does that mean I only need 30K/6 steps?

Answer 5 · 2022-08-11T22:00:08.000Z

Hi Kang, that's an interesting question. In the two cases you mentioned, the model would see the same number of images, but the number of updates applied to the large batch size setting is 6 times smaller. So probably the model is not converged yet at 5K steps.

On the other hand, one of my past observations was that the large batch size really helped the training stability of VQGAN. You will probably see better results with such a setting.

Answer 6 · 2022-08-12T03:28:29.000Z

I'm a little confused. In my opinion, gradient accumulation is equal to making batch size larger.
That is, in the case: "We train the VQGAN on 8 NVIDIA V100 32GB GPUs with batch size = 2 on each gpu and accumulated
batches = 6 for 30K steps", the updates step should be 5K, the same as the large batch size (=96) case. (if you update weight at step i, the next five steps you just accumulate the gradient without updating, so the number of updates is 30K/6).

Please correct me.

Answer 7 · 2022-08-12T04:58:14.000Z

Oh I see what you mean now! You are right that gradient accumulation is equal to using larger batch sizes. But the 30K steps is referring to the number of steps taken by the optimizer. So even with larger batch size 96 and no gradients accumulation, you should still train the model with the same number of steps 30K.