exl2
eramax opened this issue · 2 comments
eramax commented
using exl2 2.4 you can run mixtral on colab, did you give it a try ?
dvmazur commented
Hey! We are currently looking into other quantization approaches, both to improve inference speed and LM quality. How good is exl2's 2.4 quantization? 2.4 bits per parameters sounds like it reduces perplexity quite a bit. Could you provide any links, so we can look into it?
eramax commented
@dvmazurm I made this example for you https://gist.github.com/eramax/b6fc0b472372037648df7f0019ab0e78
one note is colab T4 with 15 GB Vram is not enough for the context of Mixtral-8x7B
if it was 16 GB it will work fine, since we need some vram for the context beside the model and the 2.4 model get loaded in about 14.7 GB.