FlagAI-Open/FlagAI

[Question]:怎么使用deepspeed方式运行Aquila-pretrain,官方可以提供相关运行脚本吗

zt1556329495 opened this issue · 2 comments

Description

直接把env_type改成deepspeed会报如下错误:
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.44617509841918945 seconds
Loading extension module utils...
Time to load utils op: 0.3033127784729004 seconds
Loading extension module utils...
Time to load utils op: 0.7057700157165527 seconds
Loading extension module utils...
Time to load utils op: 0.40380406379699707 seconds
Loading extension module utils...
Time to load utils op: 0.4030745029449463 seconds
Rank: 1 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
Rank: 0 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
Rank: 7 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
Rank: 5 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
Rank: 6 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
Rank: 2 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
Rank: 3 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
Rank: 4 partition count [8, 8] and sizes[(911908864, False), (33280, False)]
[2023-07-10 09:01:23,550] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-07-10 09:01:23,551] [INFO] [utils.py:786:see_memory_usage] MA 14.35 GB Max_MA 14.35 GB CA 14.36 GB Max_CA 14 GB
[2023-07-10 09:01:23,551] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 115.52 GB, percent = 7.6%
/data/yhz/envs/zt_aquila/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126428 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126429 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126430 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126431 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126432 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126433 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126436 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 126428 via 15, forcefully exitting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 126430 via 15, forcefully exitting via 9

Alternatives

No response

同时也按照Aquila-chat里面的deepspeed的json文件配置了Aquila-pretrain的

现在还有这种问题吗,我这儿是可以运行的,会不会是显存不够?