ndif-team/nnsight

NNsight bug with quantized models

Closed this issue · 3 comments

I'm using the NNsight class to wrap a 4-bit quantized LLaVA model and encountered this error:

...

File ~/miniconda3/envs/llava/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:429, in Linear4bit.forward(self, x)
    426     x = x.to(self.compute_dtype)
    428 bias = None if self.bias is None else self.bias.to(self.compute_dtype)
--> 429 out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
    431 out = out.to(inp_dtype)
    433 return out

File ~/miniconda3/envs/llava/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py:1819, in FakeCopyMode.__torch_function__(self, func, types, args, kwargs)
   1817 else:
   1818     with torch._C.DisableTorchFunctionSubclass():
-> 1819         return func(*args, **kwargs)

TypeError: Multiple dispatch failed for 'torch._ops.aten.t.default'; all __torch_dispatch__ handlers returned NotImplemented:

I was suspecting it's the quantization that caused the error, so I tried to replicate the error on GPT-2 and got the same error.
Code:

from nnsight import NNsight
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
model_path = "openai-community/gpt2"
kwargs = {"device_map": "auto"}
kwargs['quantization_config'] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)
q_model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
q_tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

q_model = NNsight(q_model)

input_ids = q_tokenizer.encode("The Eiffel Tower is in the city of", return_tensors="pt")

with model.trace() as tracer:
  with tracer.invoke(input_ids):
    model.transformer.h[-1].mlp.output[0][:] = 0
    intervention = model.lm_head.output.argmax(dim=-1).save()
  with tracer.invoke(input_ids):
    original = model.lm_head.output.argmax(dim=-1).save()

I'm able to get around this error by using the LanguageModel class, but there is another error for using the LanguageModel class because the way LLaVA prepares inputs is different from normal LMs (it processes images as well), but it caused another error (see #85).

Environment:

pytorch                   2.1.0
pytorch-cuda              11.8
transformers              4.37.2
accelerate                0.27.0
bitsandbytes              0.43.0
...

This works

from nnsight import LanguageModel
from transformers import BitsAndBytesConfig, AutoModelForCausalLM,AutoTokenizer
import torch

model_path = "openai-community/gpt2"
kwargs = {"device_map": "auto"}
kwargs['quantization_config'] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)

q_model = LanguageModel("openai-community/gpt2", **kwargs)

with torch.no_grad():
    
    with q_model.trace(validate=False) as tracer:
        with tracer.invoke("Green eggs and", scan=False):
            q_model.transformer.h[-1].mlp.output[:] = 0.
            intervention = q_model.lm_head.output.softmax(-1).argmax(dim=-1).save()
            
        with tracer.invoke("Green eggs and", scan=False):
            original = q_model.lm_head.output.softmax(-1).argmax(dim=-1).save()

going to expand this discussion on the discord https://discord.gg/sF64FeFq

Cool, thanks!