Can wanda speedup LLM inference performance?

Question

Can wanda speedup LLM inference performance?

ifromeast opened this issue a year ago · 1 comments

I have tested a llama-13b model, the pruned model's latency is almost the same as the raw model.

Answer 1 · 2023-07-11T07:34:13.000Z

Wanda is a weight pruning method developed for LLMs. The speedup guarantee and conditions are the same as the standard magnitude pruning approach. For example, unstructured sparsity would need specialized hardware for practical speedup.