Forked from PEFT. Thanks for their outstanding work. I hope this library can help developers easily identify the problem.
I want to utilize LORA to fineturn a LLM.
Use LORA to finetune the mt0-xxl model in a complete epoch with no Out of Memmory
on opus_dev_dev_tst or opus9_select_train_labse_dev_tst_add_task
datasets.
-
When I finetune mt0-xxl(13B) through peft-lora on 2 V100(32 G), batch size=8 on data
wmt16enro_dev_dev_test_task.zip
(train:3998 validation:3998 test:3998), The program can run successfully. -
However, when I change to 372,000 pieces of train data
opus_dev_dev_tst
and run it on 4V100 or 8V100 with bathc_size=4 . I met theCUDA out of memory
problem (always stop at step=5952/23192 on V1004 and stop at step=2976/11596 on V1008 ). Exactly half of the training step when use more GPUs. -
Additionally, when I reduce the data size. Fineturn model on train data
opus9_select_train_labse_dev_tst_add_task
. and run it on 4 V100 or 8 V100. I met theCUDA out of memory
problem (always stop at step=303/1000). -
I attempted to modify many parameters that may have an impact on GPU memory. But the steps that appear in OOM remain completely unchanged.
-
To save GPU memory, I have tried some methods (I can't guarantee if there is a conflict between them):
1. deepspeed zero-stage 3,2,1 + off load
2. bf16
3. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
4. model.enable_input_require_grads()
model.gradient_checkpointing_enable() # dsj
5. gc.collect() torch.cuda.empty_cache() # import gc
get_accelerator().empty_cache()
model.empty_partition_cache()
6. outputs = model(**batch, use_cache=False)
7. reduce:
batch size
stage3_prefetch_bucket_size
stage3_param_persistence_threshold
stage3_max_live_parameters
stage3_max_reuse_distance
8. increase: gradient_accumulation_steps
The main difference is the data size
. Is it a matter of data size?
ZeRO-3 should automatically collect and partition model parameters during the forward and backward passes. According to ZeRO-stage-3, when I only add data size, I should have no problem running on 4 V100.
Could you give me some suggestions about partition model parameters in a more controllable and reasonable way?
git clone https://github.com/dsj96/peft.git
cd pre_trained_model
git clone https://huggingface.co/bigscience/mt0-xxl
In order to reproduce the bug, I have released the dataset used in the formal training in data
file.
dataset | datasize | OOM step(v100*4) | OOM step(v100*8) | batch_size |
---|---|---|---|---|
opus_dev_dev_tst | train:371068 validation:371074 test:371074 | 5952/23192 | 2976/11596 | 4 |
opus9_select_train_labse_dev_tst_add_task | train:32000 validation:32000 test:32000 | 303/1000 | - | 8 |
wmt16enro_dev_dev_test_task | train:3998 validation:3998 test:3998 | run correctly in a complete epoch. |
Exactly half of the training step when use more GPUs.
Tesla V100(32G) * 4 or 8
.
More details refer to pip_list.txt
ds_report
[2023-07-20 04:58:58,855] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/opt/conda/lib/python3.8/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
examples/mt0_peft_lora_ds_zero3_offload.py modified according examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
bash mt0.sh
Some failed config files and log reports are in the error_log
directionary when training on the opus9_select_train_labse_dev_tst_add_task
dataset.