mobiusml/aana_sdk

HQQ Deployment Config

Opened this issue · 0 comments

Feature Summary

  • Adding deployments to load HQQ models in Aana

Justification/Rationale

  • Enabling quantization of the models for faster execution

Proposed Implementation

There are two approach fro implementation
1 - Loading model directly if the quantized model is stored in Hugging face (https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib)
2- Qunatized the model on the fly using AutoHQQHFModel.quantize and add a backend like bitblas for compile the model
Based on the speed we can store the quantized model for faster loading

Steps:

  • adding deployments
  • adding test for the new deployments
  • update the docs for HQQ