PygmalionAI/aphrodite-engine

[Feature]: Speculative decoding with dual GPUs

Closed this issue ยท 1 comments

๐Ÿš€ The feature, motivation and pitch

I got this error when trying speculative decoding with 2 4090s:

And it looks like that was fixed/added recently?

On dual 4090s, it's possible to double the inference speed of a 70B Exllama v2 model using TinyLlama-1.1B-1T-OpenOrca-GPTQ as the draft/speculator - see this u/lone_striker post:

https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/

So it would be awesome to have this in Aphrodite.

As of v0.6.0, we can load the draft model and target model at different tensor parallel levels. Please see the documentation for the argument to use.