qwopqwop200/GPTQ-for-LLaMa

Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware.

sigmareaver opened this issue · 3 comments

Hello everyone,

Recently I noticed a lack of 4-bit quantized versions of Google/flan-ul2 on HF, and so, decided to set out to quantize the model on my 4090.

I struggled with this for a bit due to my own oversights, but during my debugging, I found a section of code that, if run on the CPU, does NOT take long to run, and yet, if run on the GPU, requires a significant allocation of memory. Thus I propose the following changes inside of t5_sequential() in the t5.py script file:

        del layer
        del gptq 
        gc.collect()
        torch.cuda.empty_cache()

        inps, outs = outs, inps
        
    # do this part on CPU, because GPU runs out of memory
    dev = 'cpu'

    model.encoder.final_layer_norm = model.encoder.final_layer_norm.to(dev)
    model.encoder.dropout = model.encoder.dropout.to(dev)
    
    encoder_hidden_states = model.encoder.final_layer_norm(inps.cpu())
    encoder_hidden_states = model.encoder.dropout(encoder_hidden_states)
    
    model.encoder.final_layer_norm = model.encoder.final_layer_norm.cpu()
    model.encoder.dropout = model.encoder.dropout.cpu()

    dev = 'cuda:0'
    encoder_hidden_states = encoder_hidden_states.to(dev)
    inps = inps.to(dev)
    # end of CPU section

At the moment, I'm not actually certain whether or not using gc.collect() is entirely necessary, but it also doesn't hurt. The real memory saving changes come from running model.encoder.final_layer_norm and model.encoder.dropout on the CPU. This (combined with --n_samples 256 and PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512) enabled me to quantize a 20b model on my 4090 where I otherwise would not have been able to. Or at least, not with that size of sample count.

Note: Use of gc.collect() requires adding import gc somewhere (preferably near the top) in the file.

neat

sigjhl commented

Hello everyone,

Recently I noticed a lack of 4-bit quantized versions of Google/flan-ul2 on HF, and so, decided to set out to quantize the model on my 4090.

I struggled with this for a bit due to my own oversights, but during my debugging, I found a section of code that, if run on the CPU, does NOT take long to run, and yet, if run on the GPU, requires a significant allocation of memory. Thus I propose the following changes inside of t5_sequential() in the t5.py script file:

        del layer
        del gptq 
        gc.collect()
        torch.cuda.empty_cache()

        inps, outs = outs, inps
        
    # do this part on CPU, because GPU runs out of memory
    dev = 'cpu'

    model.encoder.final_layer_norm = model.encoder.final_layer_norm.to(dev)
    model.encoder.dropout = model.encoder.dropout.to(dev)
    
    encoder_hidden_states = model.encoder.final_layer_norm(inps.cpu())
    encoder_hidden_states = model.encoder.dropout(encoder_hidden_states)
    
    model.encoder.final_layer_norm = model.encoder.final_layer_norm.cpu()
    model.encoder.dropout = model.encoder.dropout.cpu()

    dev = 'cuda:0'
    encoder_hidden_states = encoder_hidden_states.to(dev)
    inps = inps.to(dev)
    # end of CPU section

At the moment, I'm not actually certain whether or not using gc.collect() is entirely necessary, but it also doesn't hurt. The real memory saving changes come from running model.encoder.final_layer_norm and model.encoder.dropout on the CPU. This (combined with --n_samples 256 and PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512) enabled me to quantize a 20b model on my 4090 where I otherwise would not have been able to. Or at least, not with that size of sample count.

Note: Use of gc.collect() requires adding import gc somewhere (preferably near the top) in the file.

Hey, thank you for your flan-ul2 quant! I really appreciate it. Very useful model, but unfortunately overlooked due to the flood of llamas..

Seeing your work, I tried to quantize a flan-t5-xxl but the resulting file outputs nonsense.
Also, while quantizing the error reported per layer seems large (10s~100s) although I am not familiar with the normal range.
I tried to monkey patch here and there, but as a novice in programming I am at my wits end.

Can you please point me in the right direction, or share your modifications to the t5 branch?

Hey, thank you for your flan-ul2 quant! I really appreciate it. Very useful model, but unfortunately overlooked due to the flood of llamas..

Seeing your work, I tried to quantize a flan-t5-xxl but the resulting file outputs nonsense. Also, while quantizing the error reported per layer seems large (10s~100s) although I am not familiar with the normal range. I tried to monkey patch here and there, but as a novice in programming I am at my wits end.

Can you please point me in the right direction, or share your modifications to the t5 branch?

Aside from the change listed in this issue, I've made no other change. I don't know why your quantized model isn't working really, but according to the config.json files, flan-ul2 is bfloat16 while flan-t5-xxl is float32. I'm not sure if this makes a difference or not, but it may. I also don't know which flan-t5-xxl model you downloaded. The repo has several formats it looks like. I'd recommend only getting the bin files and double check their SHA256 against what is listed on HF. Sorry I couldn't be more help.