HQQ Deployment Config
Opened this issue · 0 comments
HRashidi commented
Feature Summary
- Adding deployments to load HQQ models in Aana
Justification/Rationale
- Enabling quantization of the models for faster execution
Proposed Implementation
There are two approach fro implementation
1 - Loading model directly if the quantized model is stored in Hugging face (https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib)
2- Qunatized the model on the fly using AutoHQQHFModel.quantize
and add a backend like bitblas for compile the model
Based on the speed we can store the quantized model for faster loading
Steps:
- adding deployments
- adding test for the new deployments
- update the docs for HQQ