[Feature]: Speculative decoding with dual GPUs

Question

[Feature]: Speculative decoding with dual GPUs

Closed this issue a month ago · 1 comments

josephrocca commented 4 months ago

🚀 The feature, motivation and pitch

I got this error when trying speculative decoding with 2 4090s:

vllm-project/vllm#4358

And it looks like that was fixed/added recently?

vllm-project/vllm#4840

On dual 4090s, it's possible to double the inference speed of a 70B Exllama v2 model using TinyLlama-1.1B-1T-OpenOrca-GPTQ as the draft/speculator - see this u/lone_striker post:

https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/

So it would be awesome to have this in Aphrodite.

Answer 1 · 2024-09-03T13:27:28.000Z

As of v0.6.0, we can load the draft model and target model at different tensor parallel levels. Please see the documentation for the argument to use.