Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware.
sigmareaver opened this issue · 3 comments
Hello everyone,
Recently I noticed a lack of 4-bit quantized versions of Google/flan-ul2
on HF, and so, decided to set out to quantize the model on my 4090.
I struggled with this for a bit due to my own oversights, but during my debugging, I found a section of code that, if run on the CPU, does NOT take long to run, and yet, if run on the GPU, requires a significant allocation of memory. Thus I propose the following changes inside of t5_sequential()
in the t5.py
script file:
del layer
del gptq
gc.collect()
torch.cuda.empty_cache()
inps, outs = outs, inps
# do this part on CPU, because GPU runs out of memory
dev = 'cpu'
model.encoder.final_layer_norm = model.encoder.final_layer_norm.to(dev)
model.encoder.dropout = model.encoder.dropout.to(dev)
encoder_hidden_states = model.encoder.final_layer_norm(inps.cpu())
encoder_hidden_states = model.encoder.dropout(encoder_hidden_states)
model.encoder.final_layer_norm = model.encoder.final_layer_norm.cpu()
model.encoder.dropout = model.encoder.dropout.cpu()
dev = 'cuda:0'
encoder_hidden_states = encoder_hidden_states.to(dev)
inps = inps.to(dev)
# end of CPU section
At the moment, I'm not actually certain whether or not using gc.collect()
is entirely necessary, but it also doesn't hurt. The real memory saving changes come from running model.encoder.final_layer_norm
and model.encoder.dropout
on the CPU. This (combined with --n_samples 256
and PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
) enabled me to quantize a 20b model on my 4090 where I otherwise would not have been able to. Or at least, not with that size of sample count.
Note: Use of gc.collect()
requires adding import gc
somewhere (preferably near the top) in the file.
neat
Hello everyone,
Recently I noticed a lack of 4-bit quantized versions of
Google/flan-ul2
on HF, and so, decided to set out to quantize the model on my 4090.I struggled with this for a bit due to my own oversights, but during my debugging, I found a section of code that, if run on the CPU, does NOT take long to run, and yet, if run on the GPU, requires a significant allocation of memory. Thus I propose the following changes inside of
t5_sequential()
in thet5.py
script file:del layer del gptq gc.collect() torch.cuda.empty_cache() inps, outs = outs, inps # do this part on CPU, because GPU runs out of memory dev = 'cpu' model.encoder.final_layer_norm = model.encoder.final_layer_norm.to(dev) model.encoder.dropout = model.encoder.dropout.to(dev) encoder_hidden_states = model.encoder.final_layer_norm(inps.cpu()) encoder_hidden_states = model.encoder.dropout(encoder_hidden_states) model.encoder.final_layer_norm = model.encoder.final_layer_norm.cpu() model.encoder.dropout = model.encoder.dropout.cpu() dev = 'cuda:0' encoder_hidden_states = encoder_hidden_states.to(dev) inps = inps.to(dev) # end of CPU sectionAt the moment, I'm not actually certain whether or not using
gc.collect()
is entirely necessary, but it also doesn't hurt. The real memory saving changes come from runningmodel.encoder.final_layer_norm
andmodel.encoder.dropout
on the CPU. This (combined with--n_samples 256
andPYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
) enabled me to quantize a 20b model on my 4090 where I otherwise would not have been able to. Or at least, not with that size of sample count.Note: Use of
gc.collect()
requires addingimport gc
somewhere (preferably near the top) in the file.
Hey, thank you for your flan-ul2 quant! I really appreciate it. Very useful model, but unfortunately overlooked due to the flood of llamas..
Seeing your work, I tried to quantize a flan-t5-xxl but the resulting file outputs nonsense.
Also, while quantizing the error reported per layer seems large (10s~100s) although I am not familiar with the normal range.
I tried to monkey patch here and there, but as a novice in programming I am at my wits end.
Can you please point me in the right direction, or share your modifications to the t5 branch?
Hey, thank you for your flan-ul2 quant! I really appreciate it. Very useful model, but unfortunately overlooked due to the flood of llamas..
Seeing your work, I tried to quantize a flan-t5-xxl but the resulting file outputs nonsense. Also, while quantizing the error reported per layer seems large (10s~100s) although I am not familiar with the normal range. I tried to monkey patch here and there, but as a novice in programming I am at my wits end.
Can you please point me in the right direction, or share your modifications to the t5 branch?
Aside from the change listed in this issue, I've made no other change. I don't know why your quantized model isn't working really, but according to the config.json files, flan-ul2 is bfloat16 while flan-t5-xxl is float32. I'm not sure if this makes a difference or not, but it may. I also don't know which flan-t5-xxl model you downloaded. The repo has several formats it looks like. I'd recommend only getting the bin files and double check their SHA256 against what is listed on HF. Sorry I couldn't be more help.