megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel
Lzhang-hub opened this issue · 9 comments
For megatron-lm train with flash-ckpt, when set pipeline parallel
, can not save sucessfully. It seems to not all ckpt save to memory.
Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.
save log :
[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 5 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 1 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 3 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 7 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 0.006s.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 0.007s.
[2024-05-29 08:54:18,536] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,537] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,705] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 0.175s.
[2024-05-29 08:54:18,717] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 0.188s.
[2024-05-29 08:54:18,767] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 0.237s.
[2024-05-29 08:54:18,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 0.34s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=60, global_shard_num=0)
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 0.343s.
[2024-05-29 08:54:33,889] [INFO] [ckpt_saver.py:630:_check_shard_step_consistence] The cached steps are [60, 0, 60, 0, 60, 0, 60, 0]
[2024-05-29 08:54:33,889] [WARNING] [ckpt_saver.py:804:save_step_checkpoint] Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.
I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.
dlrover-run --max-restarts=2 --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py \
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE \
--use-distributed-optimizer \
--num-layers 48 \
--hidden-size 1600 \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 4 \
--global-batch-size 8 \
--train-iters 100 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--split 900,50,50 \
--distributed-backend nccl \
--lr 0.00015 \
--min-lr 1.0e-5 \
--lr-decay-style cosine \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--log-interval 1 \
--save-interval 100 \
--eval-interval 1000 \
--eval-iters 10
which version of dlrover? Beside, have you test with torchrun
.
I test dlrover 0.3.7 with torchrun, with this repo ,still error.
dlrover[torch]==0.3.7. I have reproduced the issue if I do not use --use-distributed-optimizer
.
I add --use-distributed-optimizer
,get a new error.
Env: 4*8=32 a100 gpu, tp2 pp8
[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
cls._saver_instance._sync_shm_to_storage()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
self.save_step_checkpoint(event.step)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
self.commit_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
done_files = self.storage.listdir(step_done_dir)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'
I add
--use-distributed-optimizer
,get a new error. Env: 4*8=32 a100 gpu, tp2 pp8[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32. Exception in thread checkpoint-saver: Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa ver cls._saver_instance._sync_shm_to_storage() File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy nc_shm_to_storage self.save_step_checkpoint(event.step) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav e_step_checkpoint self.commit_checkpoint( File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com mit_checkpoint done_files = self.storage.listdir(step_done_dir) File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir return os.listdir(path) FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm -train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'
This error is related to the storage I use and can be ignored.
The job will be hang in the end.