HQQ Deployment Config

Question

HQQ Deployment Config

Opened this issue a month ago · 0 comments

HRashidi commented a month ago

Feature Summary

Adding deployments to load HQQ models in Aana

Justification/Rationale

Enabling quantization of the models for faster execution

Proposed Implementation

There are two approach fro implementation
1 - Loading model directly if the quantized model is stored in Hugging face (https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib)
2- Qunatized the model on the fly using AutoHQQHFModel.quantize and add a backend like bitblas for compile the model
Based on the speed we can store the quantized model for faster loading

Steps:

adding deployments
adding test for the new deployments
update the docs for HQQ