NVIDIA/RULER

Rough runtime benchmarks across tasks and context-lengths on any hardware setups

girishbalaji opened this issue · 0 comments

Are there any rough numbers folks have on any hardaware setup for the inference runtime of any standard models across tasks and various context lengths?

Using vLLM, for just 5 requests for the 131072 context length for NIAH_single_1, I'm currently seeing ~15 minutes on a single a100 for Llama 3.1 8b.

While I actively play with parallelism configs, I'm wondering if other folks have any rough order of magnitude runtime changes, they have seen across various tasks and context lengths on any hardware setups