Error in FLOPS Calculation

There's a bug in the GPT-NeoX flops calculation here:

Line 104 in a2b2020

flops_calc2 = vocab_size / (16.0 * num_layers * hidden_size)

The term proportional to the vocab_size, flops_calc2 , should share all the same prefactors as flops_calc_1. See page 12 of 2104.04473

Since this term scales inversely with both hidden_dim and num_layers, it is more significant for small models and less important for large models: for Pythia-70m the error is roughly 50257/(16 \times 6 \times 512) = 102% while for GPT-NeoX-20B it is only 50257/(16 \times 44 \times 6144) = 1.2%.

This bug seems to have been introduced only ~3 months ago in #1044 so it may not have had an impact on, e.g., any tests done while training Pythia.