Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K?
rabbitwayne opened this issue · 3 comments
Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K in the paper https://openreview.net/pdf?id=r1xMH1BtvB? Both RoBERTa-500K and ELECTRA-400K are the same size as BERT-large. I think RoBERTa-500K has only 1.25x computations than ELECTRA-400K. Why is it 4.5x?
because it uses a batch size of 8K compared to ELECTRA which uses a batch size of 2K. This means even though both runs for the same amount of time which is 400-500k steps , RoBERTA uses more computation power where they uses 1024 v100GPUs compared to ELECTRA which i think they uses either TPU-256v3 or TPU-512v2 looking at the batch size.
Thank you for the explanation!
Hello all, I just started reading the paper.And I have a few doubts.
I was wondering if you could help me with those?
- What exactly does the "Step" mean in step count? Does it mean 1 epoch or 1 minibatch?
- Also, in paper I saw (specifically in Table 1) ELECTRA-SMALL and BERT-SMALL borh have 14M parameters, how is that possible as ELECTRA should have more parameters because its generator and discriminator module are both BERT based?
- ALso, what is the architecture of both generator and discriminator?Are they both BERT to something else?
- Also, what does 500K and 400K mean in above models like RoBERTa-500K or ELECTRA-400K?
Thanks in advance