/Quantized-llm-model

Different types of quantized models , it's key difference and different types of file formats

Primary LanguageJupyter Notebook

Quantization

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Model Formats:

GGML (GPT-Generated Model Language): Older format for storing quantized LLMs. Known for some compatibility issues and limitations. GGUF (GPT-Generated Unified Format): Newer format designed to address GGML's shortcomings. Supports CPU, GPU, and mixed CPU-GPU inference. More efficient file storage and loading. Often used for LLMs on CPU-based devices like Apple products.

Key Considerations for Choosing:

Hardware: GPU-optimized model: GPTQ or GGUF with GPU inference. CPU-based setup: AWQ or GGUF with CPU inference. Model Size: GPTQ and GGUF offer significant compression for large models. Accuracy Requirements: AWQ may preserve accuracy better in some cases. Inference Speed: GPTQ often excels in inference speed on GPUs. Compatibility: GGUF offers broader compatibility with different hardware and frameworks.

summarizing the key differences between AWQ, GPTQ, GGML, and GGUF models:

Method Quantization Focus Hardware Accuracy Speed Compatibility
AWQ Activation-aware Accuracy preservation CPU/GPU High Moderate Limited
GPTQ GPU-optimized Compression & speed GPU Moderate High Broad
GGML Older format Storage & compatibility CPU/GPU Moderate Moderate Limited
GGUF Newer unified format Storage & compatibility CPU/GPU High Moderate Broad