thunlp/WebCPM

迭代两步后内存爆炸

a101269 opened this issue · 3 comments

finetune_cpm_bee.py迭代两步后,服务器内存(不是显存)占用急剧增加,直到占满报错,问题出在更新参数的时候:
File "/home/adax/projects/WebCPM/training/scripts/../finetune_cpm_bee.py", line 210, in finetune
optim_manager.step()
File "/home/adax/anaconda3/lib/python3.9/site-packages/bmtrain-0.2.2-py3.9-linux-x86_64.egg/bmtrain/optim/optim_manager.py", line 131, in step
optimizer.step(scale=self.loss_scale)
File "/home/adax/anaconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/home/adax/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
Traceback (most recent call last):
File "/home/adax/projects/WebCPM/training/scripts/../finetune_cpm_bee.py", line 402, in
main()
File "/home/adax/projects/WebCPM/training/scripts/../finetune_cpm_bee.py", line 398, in main
return func(*args, **kwargs)
File "/home/adax/anaconda3/lib/python3.9/site-packages/bmtrain-0.2.2-py3.9-linux-x86_64.egg/bmtrain/optim/adam_offload.py", line 72, in step
finetune(args, tokenizer, model, optimizer, lr_scheduler)
File "/home/adax/projects/WebCPM/training/scripts/../finetune_cpm_bee.py", line 210, in finetune
optim_manager.step()
File "/home/adax/anaconda3/lib/python3.9/site-packages/bmtrain-0.2.2-py3.9-linux-x86_64.egg/bmtrain/optim/optim_manager.py", line 131, in step
state['_param_fp32'] = torch.empty(p.size(), dtype=torch.float32, device="cpu") # on host
optimizer.step(scale=self.loss_scale)
File "/home/adax/anaconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper
RuntimeError out = func(*args, **kwargs)
File "/home/adax/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/adax/anaconda3/lib/python3.9/site-packages/bmtrain-0.2.2-py3.9-linux-x86_64.egg/bmtrain/optim/adam_offload.py", line 67, in step
state['exp_avg'] = torch.zeros(p.size(), dtype=torch.float32, device="cpu") # on host
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 167772160 bytes. Error code 12 (Cannot allocate memory):
[enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 167772160 bytes. Error code 12 (Cannot allocate memory)

您好,请问您能提供一下服务器的具体配置嘛?CPU内存大小这种

内存不够可以不用 bmt.optim.AdamOffloadOptimizer 改用 bmt.optim.AdamOptimizer,这样的话显存占用会更大一点

内存不够可以不用 bmt.optim.AdamOffloadOptimizer 改用 bmt.optim.AdamOptimizer,这样的话显存占用会更大一点

好的谢谢,我用的服务器32核,256G内存,不过其他程序已经用了七八十G内存