batch=1,why adapter latency so much vs. LoRA in paper???
macqueen09 opened this issue · 3 comments
in lora paper section 3:
Adapter Layers Introduce Inference Latency :There are many variants of adapters. We focus
on the original design by Houlsby et al. (2019) which has two adapter layers per Transformer block
and a more recent one by Lin et al. (2020) which has only one per block but with an additional
LayerNorm (Ba et al., 2016). While one can reduce the overall latency by pruning layers or exploit-
ing multi-task settings , there is no direct ways to bypass
the extra compute in adapter layers. This seems like a non-issue since adapter layers are designed
to have few parameters (sometimes <1% of the original model) by having a small bottleneck di-
mension, which limits the FLOPs they can add. However, large neural networks rely on hardware
parallelism to keep the latency low, and adapter layers have to be processed sequentially. This makes
a difference in the online inference setting where the batch size is typically as small as one.
========
so
why this adapter model so special and adapter layers have to be processed sequentially. other part of llm such as normal transformer block or MLP layer norm in Transformer are not sequentially???
why adapter so different , not like a SEmodel ([Squeeze-and-Excitation Networks]
This is because when the bsz is small, we need to parallelize over width to gain the best hardware efficiency. Adapter adds to the depth, which has to be processed sequentially.
This is because when the bsz is small, we need to parallelize over width to gain the best hardware efficiency. Adapter adds to the depth, which has to be processed sequentially.
@edwardjhu Can you please explain this in layman terms?