locuslab/wanda

Can wanda speedup LLM inference performance?

ifromeast opened this issue · 1 comments

I have tested a llama-13b model, the pruned model's latency is almost the same as the raw model.

Wanda is a weight pruning method developed for LLMs. The speedup guarantee and conditions are the same as the standard magnitude pruning approach. For example, unstructured sparsity would need specialized hardware for practical speedup.