多卡似乎不能将每张卡跑满，请问如何才能让每张卡的计算负载跑满呢

Question

多卡似乎不能将每张卡跑满，请问如何才能让每张卡的计算负载跑满呢

Opened this issue a year ago · 13 comments

RayneSun commented a year ago

我设置了CUDA_VISIBLE_DEVICE和device_map，在2张A100上跑的时候，发现确实都有内存占用，但是gpu负载总是某张卡高，其他都很低。

Answer 1 · 2023-07-19T08:29:26.000Z

你训练用的哪个方法

Answer 2 · 2023-07-19T08:41:43.000Z

用的lora，训练baichuan-13B

Answer 3 · 2023-07-19T08:58:24.000Z

不应该呀，我训练的时候卡基本都是占满的

Answer 4 · 2023-07-19T09:17:11.000Z

大概就是这个样子，有点像是流水线并行

Answer 5 · 2023-07-19T09:22:09.000Z

是不是因为我没有用deepspeed呢？能麻烦看一下您跑baichuan-13b的shell脚本吗

Answer 6 · 2023-07-19T09:24:42.000Z

https://github.com/jianzhnie/Efficient-Tuning-LLMs/blob/main/train_lora.py#L169C12-L169C13

Answer 7 · 2023-07-19T09:25:21.000Z

或许在这个位置，开启了模型并行，你注释掉这两行试试

Answer 8 · 2023-07-21T09:50:42.000Z

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=1 train_lora.py
--model_name_or_path ../Baichuan-13B-Chat
--dataset_name train.json,test.json
--data_dir ../../data/toolbench
--load_from_local yes
--output_dir baichuan-lora
--max_steps 50000
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--evaluation_strategy no
--save_strategy steps
--save_steps 1000
--learning_rate 5e-4
--weight_decay 0.
--warmup_ratio 0.07
--optim "adamw_torch"
--lr_scheduler_type "linear"
--model_max_length 2560
--source_max_len 2048
--target_max_len 512
--logging_steps 5
--do_train
--gradient_checkpointing True
--trust_remote_code true
--lora_target_modules W_pack
--deepspeed "ds_config_zero3_auto.json

Answer 9 · 2023-07-21T09:51:17.000Z

我注释掉您说的那两句了，但是跑的时候还是单张卡占用高

Answer 10 · 2023-07-21T10:09:04.000Z

而且我把train_lora的device_map配置去掉了：

因为不去掉会报错：
ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.
请问和这个相关吗？

Answer 11 · 2023-07-21T10:37:10.000Z

好像找到问题了，需要设置启动时的参数--nproc_per_node=2

Answer 12 · 2023-07-22T07:18:30.000Z

好像找到问题了，需要设置启动时的参数--nproc_per_node=2

你能完整训练完吗，我和你一样的训练代码跑了200步就挂了

Answer 13 · 2023-07-24T02:25:18.000Z

最后没用deepspeed，速度反而会特别慢