Pareto curves clarifications
shira-g opened this issue · 2 comments
Hello,
can you please provide some clarification regarding the units used in the paper to measure FLOPs? (figure 2 in the paper).
I tried to reproduce the results and for example when measuring MACs (using torchprofile) for bert-standard model I am getting the value 35340779904, which i'm not sure how to convert in order to create the pareto curve to compare to the search results.
In addition, can you please share how did you construct the sequence-lengths for each of the standard/length-adaptive models graph points?
Thank you,
Shira
Yes. 35340779904 is about 35G, and you can find the rightmost point in the curve.
The X-axis of Figure 2 is in the log-scale of GFLOPs.
There are three curves per each pre-trained language model.
(1) a standard model with constant-rate length reduction
(2) a length-adaptive model with constant-rate length reduction
(3) a length-adaptive model with length configurations from the evolutionary search
Thank you!