ssbuild/chatglm_finetuning

ptv2显存不够?

sanwei111 opened this issue · 11 comments

显卡:v100两张,各24G
max_seq_len=512
train_batchsize=2
Traceback (most recent call last):
File "/workspace/code/code/chatglm_finetuning-stable-vocab130528-v2/train.py", line 182, in
trainer.fit(pl_model, train_dataloaders=train_datasets)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
self.strategy.setup(self)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 344, in setup
self.init_deepspeed()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 448, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 484, in _initialize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 413, in _setup_model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1408, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 485, in init
self.initialize_optimizer_states()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 608, in initialize_optimizer_states
single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.50 GiB (GPU 1; 31.75 GiB total capacity; 23.00 GiB already allocated; 7.91 GiB free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "train.py", line 182, in
trainer.fit(pl_model, train_dataloaders=train_datasets)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
self.strategy.setup(self)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 344, in setup
self.init_deepspeed()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 448, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 484, in _initialize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 413, in _setup_model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1408, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 485, in init
self.initialize_optimizer_states()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 608, in initialize_optimizer_states
single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.50 GiB (GPU 0; 31.75 GiB total capacity; 23.00 GiB already allocated; 7.91 GiB free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

模型使用的是chatglm,没有量化

oom 就是不够了。 batch 改为1 应该可以。

还是不行,运行指令是CUDA_VISIBLE_DEVICES=0,1 python train.py
已经是batchsize到1了,maxseqlen=512了

是不是修改长度后,没有删除output缓存?

改了参数之后,删除output下的数据,再来一次data_utils.py?这个做了

改了参数之后,删除output下的数据,再来一次data_utils.py?这个做了

关掉deepspeed跑跑看!

关了试一下,感觉好一点,但还是很大:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 1; 31.75 GiB total capacity; 30.54 GiB already allocated; 27.75 MiB free; 30.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

关了试一下,感觉好一点,但还是很大: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 1; 31.75 GiB total capacity; 30.54 GiB already allocated; 27.75 MiB free; 30.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

你的ptv2 pre-seq-len是多少呢?减小看看,你先实验一个能跑的参数!

默认值:32

pre-seq-len:16,batchsize-2,maxseqlen-512,还不行

用的是chatglm_finetuning-stable-vocab130528-v2这个分支