[Feature]: Speculative decoding with dual GPUs
Closed this issue ยท 1 comments
josephrocca commented
๐ The feature, motivation and pitch
I got this error when trying speculative decoding with 2 4090s:
And it looks like that was fixed/added recently?
On dual 4090s, it's possible to double the inference speed of a 70B Exllama v2 model using TinyLlama-1.1B-1T-OpenOrca-GPTQ
as the draft/speculator - see this u/lone_striker
post:
So it would be awesome to have this in Aphrodite.
AlpinDale commented
As of v0.6.0, we can load the draft model and target model at different tensor parallel levels. Please see the documentation for the argument to use.