/llama-3-quant-comparison

Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.

Creative Commons Zero v1.0 UniversalCC0-1.0

Llama 3 is an amazing open large language model. The 70B variant's weights were published as 130 GB of bfloat16 tensors in safetensors format. The smaller variant, 8B, weighs 15 GB. Thanks to quantization methods, we can run these models on consumer hardware while retaining good quality. I tested how much quantization affects the Instruct variant of these models, using the MMLU test.

Results

Quick intro

What's MMLU?

The "Massive Multitask Language Understanding" test is composed of 14042 multiple choice questions, non-uniformly distributed among 57 categories. "Correctness" in this article refers to the % of questions the model answered correctly.

Example question

Question 45 from the "high school mathematics" category, formatted for Llama 3-Instruct:

<|start_header_id|>user<|end_header_id|>

Question: To place the first paving stone in a path, Alex starts at the crate of stones, walks three feet, places the stone, and returns to the crate. For each subsequent stone, Alex walks two feet farther each way. Alex will place the first 50 stones in a path. After returning to the crate from placing the $50^\text{th}$ stone, what is the total distance Alex walked, in feet?

Choices:
A: 100
B: 90950
C: 5200
D: 50<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Answer:

To which a model is expected to reply with a single token, saying A, B, C, or D. Here, C is correct.

Question count per category
Question Count Category
1. 100 abstract algebra
2. 135 anatomy
3. 152 astronomy
4. 100 business ethics
5. 265 clinical knowledge
6. 144 college biology
7. 100 college chemistry
8. 100 college computer science
9. 100 college mathematics
10. 173 college medicine
11. 102 college physics
12. 100 computer security
13. 235 conceptual physics
14. 114 econometrics
15. 145 electrical engineering
16. 378 elementary mathematics
17. 126 formal logic
18. 100 global facts
19. 310 high school biology
20. 203 high school chemistry
21. 100 high school computer science
22. 165 high school european history
23. 198 high school geography
24. 193 high school government and politics
25. 390 high school macroeconomics
26. 270 high school mathematics
27. 238 high school microeconomics
28. 151 high school physics
29. 545 high school psychology
30. 216 high school statistics
31. 204 high school us history
32. 237 high school world history
33. 223 human aging
34. 131 human sexuality
35. 121 international law
36. 108 jurisprudence
37. 163 logical fallacies
38. 112 machine learning
39. 103 management
40. 234 marketing
41. 100 medical genetics
42. 783 miscellaneous
43. 346 moral disputes
44. 895 moral scenarios
45. 306 nutrition
46. 311 philosophy
47. 324 prehistory
48. 282 professional accounting
49. 1534 professional law
50. 272 professional medicine
51. 612 professional psychology
52. 110 public relations
53. 245 security studies
54. 201 sociology
55. 100 us foreign policy
56. 166 virology
57. 171 world religions
What's quantization?

"Quantizing" a model means converting parts of it to lower precision numerical representations to lower its memory use. This can allow running large models on limited hardware, but may hurt quality. Learn more!

Bits per weight, bpw?

Quantization methods typically use mixed precision, expressing different parts of a model in different ways. A way to characterize quantization in one number is to divide its size (or the size of quantized parts of the model) in bits by its number of parameters (weights). Mind that the number of parameters is typically expressed in metric "engineering" units (powers of 1000), and file size in JEDEC units (powers of 1024), so the formula is:

bpw = (1024/1000)^3 (size in GB) / (billions of parameters) ≈
    ≈ 1.0737 (size in GB) / (billions of parameters)
EXL2? GGUF?

These are popular quantized LLM file formats, working with Exllama v2 and llama.cpp, respectively.

Correctness vs Model Size

The following plot shows how the models slowly lose the ability to answer MMLU questions correctly the more quantized they are.

  • The points labeled "70B" correspond to the 70B variant of the Llama 3 model, the rest the 8B variant.
  • "gguf" used files provided by bartowski. The "Q-numbers" don't correspond to bpw (bits per weight) exactly (see next plot).
  • "exl2" also used files provided by bartowski, in fp16, 8 bpw, 6.5 bpw, 5 bpw, 4.25 bpw, 3.5 bpw.
  • "transformers" refers to evaluating the model using the HuggingFace transformers module and its supported bitsandbytes quantization-on-load options: 8 bit, 4 bit fp4, 4 bit nf4 (normalized float). nf4 is the better performing one.
  • The "model size" here is file size minus the size of the embeddings (which don't get loaded into VRAM) but it does include the head size. It's not easy to fairly compare different frameworks. As an example, I included the exact sizes of layers in a few variants at the end of the article.
Data table

* Note: the 70B model was evaluated with only 50 questions per category, the 8B with full MMLU.

bpw here was calculated only considering Llama 3's model.layers.*.weight layers, as the approach to quantizing the rest of the model differs significantly between methods.

Model size [GB] MMLU [%] bpw Model Quant Type
45.84 * 80.82 5.66 70B Q5_K_M GGUF
34.77 * 80.46 4.26 70B IQ4_XS GGUF
29.32 * 80.06 3.50 70B IQ3_M GGUF
25.16 * 79.09 3.04 70B IQ3_XXS GGUF
22.04 * 77.01 2.62 70B IQ2_M GGUF
20.29 * 76.05 2.38 70B IQ2_S GGUF
19.36 * 74.94 2.35 70B IQ2_XS GGUF
17.46 * 72.31 2.11 70B IQ2_XXS GGUF
15.27 * 65.21 1.81 70B IQ1_M GGUF
13.98 65.20 16.00 8B fp16 GGUF
13.98 65.20 16.00 8B fp16 Exl2
13.98 65.21 16.00 8B bf16 transformers
13.96 * 61.18 1.63 70B IQ1_S GGUF
7.43 65.23 8.50 8B Q8_0 GGUF
6.99 64.53 8.00 8B 8bit transformers
6.99 65.20 7.99 8B 8bit Exl2
5.77 64.99 6.49 8B 8bit Exl2
5.73 65.06 6.56 8B Q6_K GGUF
5.00 64.90 5.67 8B Q5_K_M GGUF
4.87 64.88 5.50 8B Q5_K_S GGUF
4.45 64.27 5.00 8B Q5_K_S Exl2
4.30 64.64 4.82 8B Q4_K_M GGUF
4.09 64.63 4.54 8B Q4_K_S GGUF
4.07 64.33 4.52 8B IQ4_NL GGUF
3.87 64.39 4.28 8B IQ4_XS GGUF
3.84 63.36 4.25 8B IQ4_XS Exl2
3.81 62.85 4.08 8B Q3_K_L GGUF
3.53 62.89 3.79 8B Q3_K_M GGUF
3.49 63.42 4.00 8B 4bitnf4 transformers
3.49 61.75 4.00 8B 4bitfp4 transformers
3.31 62.55 3.50 8B IQ3_M GGUF
3.23 60.28 3.50 8B IQ3_M Exl2
3.21 62.13 3.46 8B IQ3_S GGUF
3.20 59.14 3.44 8B Q3_K_S GGUF
3.06 61.19 3.26 8B IQ3_XS GGUF
2.83 60.52 3.04 8B IQ3_XXS GGUF
2.79 55.90 2.90 8B Q2_K GGUF
2.53 57.56 2.64 8B IQ2_M GGUF
2.35 53.98 2.40 8B IQ2_S GGUF
2.26 49.98 2.37 8B IQ2_XS GGUF
2.07 43.50 2.14 8B IQ2_XXS GGUF
1.85 28.83 1.84 8B IQ1_M GGUF
1.71 26.47 1.66 8B IQ1_S GGUF
Selected results per category

This table shows average confidence per category. Since 70B models were only evaluated on 50 questions per category, and some categories had 500+, the individual results may not be very comparable between 70B and 8B.

category 70B-Q5_K_M 70B-IQ2_XXS 8B-Q8_0 8B-IQ2_M
marketing 98.1% 94.2% 89.0% 83.2%
high school government and politics 98.1% 97.8% 90.1% 80.8%
medical genetics 96.6% 85.3% 82.7% 71.0%
jurisprudence 96.0% 93.7% 78.0% 71.0%
high school us history 95.3% 89.0% 80.0% 70.3%
high school psychology 94.8% 91.7% 84.1% 76.5%
high school microeconomics 93.9% 80.8% 75.9% 62.0%
human sexuality 93.5% 81.3% 77.7% 66.1%
astronomy 93.4% 81.2% 70.9% 62.3%
business ethics 93.2% 76.0% 66.6% 60.0%
us foreign policy 92.6% 91.2% 85.9% 78.3%
prehistory 92.5% 85.2% 73.7% 64.5%
nutrition 92.0% 89.9% 76.3% 64.1%
high school world history 91.3% 88.7% 82.8% 73.8%
college biology 90.9% 85.1% 79.3% 67.0%
high school geography 90.9% 85.7% 83.8% 74.2%
miscellaneous 90.5% 86.8% 82.8% 75.6%
high school computer science 90.3% 83.3% 71.2% 62.9%
management 89.9% 88.4% 83.5% 73.8%
sociology 89.5% 83.6% 84.4% 79.4%
international law 87.9% 86.4% 78.2% 69.7%
conceptual physics 87.4% 82.4% 57.0% 47.8%
world religions 87.2% 82.1% 82.5% 77.5%
professional medicine 86.8% 76.4% 71.7% 58.2%
philosophy 86.7% 73.2% 71.4% 66.1%
computer security 86.5% 86.5% 76.7% 73.7%
moral scenarios 86.0% 49.3% 43.5% 33.3%
human aging 85.4% 83.8% 71.8% 64.7%
high school biology 84.7% 79.0% 80.3% 71.1%
college medicine 84.4% 74.3% 65.6% 59.6%
logical fallacies 84.3% 77.6% 77.8% 69.6%
professional psychology 83.8% 75.5% 69.4% 60.9%
high school european history 83.2% 79.6% 77.9% 72.6%
clinical knowledge 82.5% 72.6% 75.0% 65.2%
high school macroeconomics 82.1% 79.2% 66.3% 55.9%
anatomy 81.7% 66.2% 69.6% 56.7%
electrical engineering 81.0% 71.8% 62.8% 55.7%
security studies 78.7% 77.5% 72.9% 68.5%
high school statistics 77.9% 53.1% 52.6% 49.8%
public relations 77.7% 64.8% 70.2% 61.3%
elementary mathematics 75.7% 63.6% 46.0% 39.1%
machine learning 74.3% 62.8% 49.8% 42.5%
high school physics 72.2% 60.7% 37.9% 33.5%
moral disputes 69.5% 60.6% 72.5% 64.7%
high school chemistry 65.8% 59.4% 52.0% 44.4%
college computer science 65.6% 57.5% 55.0% 50.6%
college physics 65.2% 49.9% 45.7% 43.0%
formal logic 62.5% 50.1% 49.7% 42.2%
econometrics 61.9% 52.2% 53.0% 41.5%
abstract algebra 60.7% 41.7% 29.7% 29.5%
college mathematics 60.0% 43.1% 36.6% 31.1%
virology 59.1% 55.6% 51.6% 49.6%
professional law 58.0% 52.6% 46.8% 41.6%
global facts 56.7% 44.6% 39.1% 33.0%
professional accounting 54.8% 45.4% 52.1% 47.1%
high school mathematics 54.1% 44.4% 34.9% 29.6%
college chemistry 52.7% 49.4% 45.1% 40.7%

Key takeaways:

  • Modern GGUF "I-Quants" seem to offer the best quality at given size.
  • All tested quantization formats perform the same at fp16.
  • Going as low as ~5 bpw has minimal impact on quality in this test.
  • transformers quantization is slightly lower quality, except bitsandbytes 4 bit nf4 which interestingly keeps up with GGUF, beating all "*Q3*" quants.

Is exllamav2 under-performing?

For lower bpw it seems to score lower on MMLU compared to GGUF of the same file size. However, file size does not exactly correlate to the memory a model will use. Considering the average bpw of the quantized layers (as in the next figure) may be a more fair comparison. Still, ExLlamaV2 offers some advantages:

  • It can be faster than fully offloaded GGUF, depending on the task. In this test it was almost twice as fast, processing 14 thousand tokens per second vs 7500 for llama.cpp. It seems that for the same bpw, EXL2 resulted in worse MMLU scores. But for now ExLlamaV2 still offers some unique advantages:
  • It offers 4 bit cache, which allows quartering the memory necessary for context size. If you count context, EXL2 may become your best option until llama.cpp implements this feature (see: 1, 2). Especially if you need 16k+ context length.
  • I personally found it to be easiest and most pleasant to work with from Python's level, but that's subjective and depends on the task.

Confidence vs bpw

Confidence here is the average normalized probability that the model would give a correct answer if we only consider the 4 tokens corresponding to valid answers. Random noise would result in 25% confidence (and 25% correctness) because I'm normalizing the 4 possible answers to add up to 100%.

The main takeaway here is that the 70B model is less affected by quantization. Perhaps it's more sparse relative to the 8B one. Extremely low quants of 70B remain somewhat useable, whereas 8B-IQ1-M and -S are near the random noise threshold.

Here I plotted the loss of confidence (change from maximum). It seems to change like $\propto \text{bpw}^{-4.25}$. I had to include this, because no one can resist "things looking linear on a log-log plot."

Methodology

Shortcomings of this test

  • To calculate MMLU, I made a mistake of using the first five questions in each "test" category for 5-shot evaluation, instead of using the "dev" set. This changes the MMLU values slightly, but isn't an issue in this study, as I'm only comparing my own tests to one another and focusing on their relative differences.
  • I skipped around 20 questions where the 5-shot prompt was above 2048 tokens.
  • I did not set a constant random seed, perhaps introducing unnecessary noise to the test.

Applicability of this test

From anecdotal experience, it seems that quantization affects "rigorous" tasks like writing working source code more than open-ended tasks like creative writing. It would be interesting to methodically measure the effect of quantization on demanding programming benchmarks.

Shortcomings of MMLU

It's okay for this purpose

For this study, MMLU is fine because it's an ablation study. Even if MMLU is a flawed quality benchmark, it's good enough to see how a model's answers change with quantization.

Is MMLU still relevant?

It used to be arguable whether MMLU is a good benchmark, but it does not really matter any more, as top models are scoring around 90%. There's not much room for improvement, but fortunately harder benchmarks are being proposed.

It's partially broken

Some of MMLU questions are broken, arguably useful, opinionated, or lack necessary context. Example, question 133 from the "high school psychology" category:

As a result of an accident, Abdul lost sight in his right eye. To judge the distance of vehicles when he is driving, Abdul is able to rely on cues of

A. I only
B. II only
C. III only
D. I and II only

This question lacks statements numbered I, II, and III necessary to answer it.

Inference code

I based my code on the test included in ExLlamaV2's repository, but modified it heavily.

You can find pre-compiled Python wheels for inference libraries listed in the text-generation-webui repository.

The MMLU dataset can be found on HuggingFace and read with pandas.

The snippets assume that you load a list of strings representing the questions and answers as prompts and answers.

transformers

Simplified transformers source code
import torch
import transformers

model_path = "path/to/model"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
config = transformers.PretrainedConfig.from_pretrained(model_path)
config.max_position_embeddings = 2048

quantization_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_enable_fp32_cpu_offload=True,
)

model = transformers.LlamaForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    config=config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True,
    quantization_config=quantization_config,
)

answer_tokens = tokenizer.encode(
    " A B C D", add_special_tokens=False, return_tensors="pt"
)

with torch.no_grad(): # crucial for lower memory use
    for prompt, answer in zip(prompts, answers):
        prompt_ids = tokenizer.encode(
            prompt, add_special_tokens=False, return_tensors="pt"
        )

        logits_ans = model.forward(prompt_ids.cuda()).logits[:, -1, answer_tokens].cpu()
        # process the answer
        torch.cuda.empty_cache()

llama-cpp-python

Mind to install a correct version of llama-cpp-python, with CUDA support if you can use it. Adjust n_gpu_layers if you can't offload the full model. A model's total number of layers is listed in its config.json as num_hidden_layers.

Simplified llama-cpp-python source code
import torch
from llama_cpp_cuda_tensorcores import Llama, llama_tokenizer

model_path = "path/to/model.gguf"
tokenizer_base = "path/to/model"  # where tokenizer.json is located

llama_params = {
    "model_path": model_path,
    "n_ctx": 2048,  # Text context, 0 = from model
    "n_batch": 512,  # Prompt processing maximum batch size
    "n_gpu_layers": -1,  # -1 offloads ALL layers
    "n_threads": 8,  # Number of threads to use for generation
    "n_threads_batch": 8,  # Number of threads to use for batch processing
    "logits_all": False,  # Not needed for model.eval()
    "offload_kqv": True,  # Offload K, Q, V to GPU.
    "tokenizer": llama_tokenizer.LlamaHFTokenizer.from_pretrained(
        tokenizer_base
    ),  # Optional tokenizer to override the default tokenizer from llama.cpp.
    "verbose": False,  # Don't print verbose output to stderr.
}

model = Llama(**llama_params)

answer_tokens = model.tokenize(" A B C D".encode(), add_bos=False)

for prompt, answer in zip(prompts, answers):
    prompt_ids = model.tokenize(prompt.encode(), add_bos=False)

    model.reset()
    model.eval(prompt_ids)
    logits = model.scores[model.n_tokens - 1]
    logits_ans = torch.tensor([logits[i] for i in answer_tokens], device="cpu")

ExLlamaV2

Simplified ExLlamaV2 source code
from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Cache,
    ExLlamaV2Config,
    ExLlamaV2Tokenizer,
)

model_path = "path/to/model-exl2"
config = ExLlamaV2Config()
config.model_dir = model_path
config.prepare()
config.max_seq_len = 2048
model = ExLlamaV2(config)
tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, max_seq_len=2048, lazy=True)
model.load_autosplit(cache)

answer_logits = tokenizer.encode(" A B C D")

for prompt, answer in zip(prompts, answers):
    prompt_ids = tokenizer.encode(prompt)
    logits = model.forward(prompt_ids, last_id_only=True)
    logits_ans = logits[:, :, answer_logits].cpu()

Evaluating the results from logits_ans involves checking if the highest logit corresponds to the correct answer. To measure confidence, record the normalized probability for the correct answer. Here, answer_id is in {0, 1, 2, 3} and corresponds to the correct answer token.

prob_ans = torch.softmax(logits_ans, dim=-1)
confidence = float(prob_ans[0, answer_id])
correct = bool(prob_ans.argmax() == answer_id)

Lastly, record the individual results per question or compute the averages, minding the varying number of questions per category.

The Tensors

Size of individual model layers

This table compares the layer sizes of:

  • the original transformers Llama 8B in bfloat16
  • EXL2 8.0 bpw
  • GGUF Q8_0 (which is ~8.5 bpw)
layer transformers [bytes] EXL2 [bytes] EXL2 [bpw] GGUF [bytes] GGUF [bpw]
model.embed_tokens 1 050 673 152 1 050 673 152 16.00 558 170 112 8.50
lm_head 1 050 673 152 527 405 248 8.03 558 170 112 8.50
model.norm 8 192 8 192 16.00 16 384 32.00
*.input_layernorm 262 144 262 144 16.00 524 288 32.00
*.self_attn.q_proj 1 073 741 824 539 498 496 8.04 570 425 344 8.50
*.self_attn.k_proj 268 435 456 135 272 448 8.06 142 606 336 8.50
*.self_attn.v_proj 268 435 456 135 272 448 8.06 142 606 336 8.50
*.self_attn.o_proj 1 073 741 824 539 498 496 8.04 570 425 344 8.50
*.post_attention_layernorm 262 144 262 144 16.00 524 288 32.00
*.mlp.down_proj 3 758 096 384 1 875 792 896 7.99 1 996 488 704 8.50
*.mlp.gate_proj 3 758 096 384 1 874 073 600 7.98 1 996 488 704 8.50
*.mlp.up_proj 3 758 096 384 1 874 073 600 7.98 1 996 488 704 8.50
model.layers.* 13 959 168 000 6 974 006 272 7.99 7 416 578 048 8.50

All the "*." layers add up to "model.layers.*".