Rough runtime benchmarks across tasks and context-lengths on any hardware setups
girishbalaji opened this issue · 0 comments
girishbalaji commented
Are there any rough numbers folks have on any hardaware setup for the inference runtime of any standard models across tasks and various context lengths?
Using vLLM, for just 5 requests for the 131072 context length for NIAH_single_1, I'm currently seeing ~15 minutes on a single a100 for Llama 3.1 8b.
While I actively play with parallelism configs, I'm wondering if other folks have any rough order of magnitude runtime changes, they have seen across various tasks and context lengths on any hardware setups