The evaluation speed of IT2T on VisDial

Hi, thank you for the wonderful work. I was reproducing the IT2T evaluation for VisDial and I found when I run the 5th code block, the estimate time is pretty long (more than 3 days). Is that a reasonable evaluation time or there are some bugs I need tp fix (or anything I can do to speed up the process). Thank you in advance!

fromage/evals/VisDial_Inference_IT2T_Generation.ipynb

Line 185 in ddc4942

"topk = (1, 5, 10)\n",

Thanks for letting me know about this. What GPU are you using for this? It may be reasonable depending on that.

Just to confirm, are you also loading images from a local disk? (just making sure that network access is not the reason it is slow)

For me, on an A6000 GPU (batch size of 20), it takes around 10 hours to run. You might consider running it for less samples, 200-300 samples would probably be enough to estimate performance.

Thanks for the reply. I am using a V100-32GB GPU (bs 20). I am loading images from a local disk.

BTW, how much VRAM is used for this eval? I tried 20 batch size on 32GB GPU but get CUDA out of memory.

Thank you in advance!

Thanks for your patience. I tested it myself on a V100 and I got a similar ETA as you (> 72 hours). I tried to introduce some batching optimizations, but it didn't have a significant effect. I think this is because the V100 is significantly slower relative to the A6000, especially on bf16 data.

The reason it takes so long is because the VisDial eval has 2064 examples in its val set, and each has 10 rounds of dialogue (= 20640 examples to be evaluated). And each example has 100 answer options, so to pick the best one we compute the log likelihood of each example by passing them through Fromage. So you can imagine that this is a huge number of tokens (100 options * max seq_len * 10 rounds for each of the 2064 examples) 🙂

I don't have a good solution for speeding it up at the moment, except to maybe run it on less examples (200 should give you a good estimate). Sorry about that!

BTW, how much VRAM is used for this eval? I tried 20 batch size on 32GB GPU but get CUDA out of memory.

The base model takes about 18 GB, and I think a batch size of 20 uses up to 41GB (it's bounded by the longest sequence length in VisDial). You should be able to fit a lower batch size (I tried 10, 16 may work?) on 32GB, though.

Thanks for your reply! Yeah I think a batch size of 10 could work on 32GB.

Anyway, really appreciate your help and it's a really interesting and solid work.

I heard that the slowdown might actually be due to bfloat16 not being supported on V100s (not 100% sure though)? It could be worth trying to pass the model in as fp16 instead.

Oh, thanks, I'll try on them! BTW, I am a bit curious about the evaluation pipeline here. It seems like the perplexity computation process is making the things pretty inefficient, does it decide by the performance? Have you try any other way of evaluation? Thanks so much!

That's good to know!

Sorry, I haven't tried other ways of measuring it. I think that measuring loss/perplexity is quite a standard way for multiple choice questions, so not sure if there are other approaches that can be used.

Gotcha, it makes sense. Thank you so much!