intelligent-machine-learning/dlrover

megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel

Lzhang-hub opened this issue · 9 comments

For megatron-lm train with flash-ckpt, when set pipeline parallel, can not save sucessfully. It seems to not all ckpt save to memory.
Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.

save log :

[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 1 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,530] [INFO] [engine.py:303:save_state_dict_to_memory] 3 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 5 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 2 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 7 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,531] [INFO] [engine.py:303:save_state_dict_to_memory] 6 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,532] [INFO] [engine.py:303:save_state_dict_to_memory] 4 acquired the lock of shared memory: True.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 5 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 1 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 3 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,534] [INFO] [engine.py:309:save_state_dict_to_memory] Rank 7 skips the save the checkpoint in CPU memory since it is saving the latest checkpoint from the CPU memory into the storage.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_memory in 0.006s.
[2024-05-29 08:54:18,535] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_memory in 0.007s.
[2024-05-29 08:54:18,536] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,537] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_memory in 0.008s.
[2024-05-29 08:54:18,705] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_memory in 0.175s.
[2024-05-29 08:54:18,717] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_memory in 0.188s.
[2024-05-29 08:54:18,767] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_memory in 0.237s.
[2024-05-29 08:54:18,871] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_memory in 0.34s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 1 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 2 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 3 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 4 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 0 execute save_to_storage in 0.341s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 5 execute save_to_storage in 0.343s.
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 6 execute save_to_storage in 0.342s.
[2024-05-29 08:54:18,872] [INFO] [ckpt_saver.py:532:_sync_shm_to_storage] ShardingSaver save checkpoint to storage, event CheckpointEvent(type=<CheckpointEventType.SAVE: 1>, step=60, global_shard_num=0)
[2024-05-29 08:54:18,872] [INFO] [engine.py:99:wrapper] Local rank 7 execute save_to_storage in 0.343s.
[2024-05-29 08:54:33,889] [INFO] [ckpt_saver.py:630:_check_shard_step_consistence] The cached steps are [60, 0, 60, 0, 60, 0, 60, 0]
[2024-05-29 08:54:33,889] [WARNING] [ckpt_saver.py:804:save_step_checkpoint] Skip persisting the checkpoint of step 60 because the cached step in memory are not consistent.

I have successfully tested the flash checkpoint using the following command with 4 A100 nodes in the forked repo whose commit id is cb995d5.

dlrover-run --max-restarts=2  --nnodes=$NNODES --nproc_per_node=$GPUS_PER_NODE pretrain_gpt.py \
       --tensor-model-parallel-size $TP_SIZE \
       --pipeline-model-parallel-size $PP_SIZE \
	   --use-distributed-optimizer \
       --num-layers 48 \
       --hidden-size 1600 \
       --num-attention-heads 16 \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
       --micro-batch-size 4 \
       --global-batch-size 8 \
       --train-iters 100 \
       --lr-decay-iters 320000 \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --vocab-file $VOCAB_FILE \
       --merge-file $MERGE_FILE \
       --split 900,50,50 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --min-lr 1.0e-5 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --log-interval 1 \
       --save-interval 100 \
       --eval-interval 1000 \
       --eval-iters 10 

which version of dlrover? Beside, have you test with torchrun.

I test dlrover 0.3.7 with torchrun, with this repo ,still error.

dlrover[torch]==0.3.7. I have reproduced the issue if I do not use --use-distributed-optimizer.

I add --use-distributed-optimizer,get a new error.
Env: 4*8=32 a100 gpu, tp2 pp8

[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
    cls._saver_instance._sync_shm_to_storage()
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
    self.save_step_checkpoint(event.step)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
    self.commit_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
    done_files = self.storage.listdir(step_done_dir)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
    return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'

I add --use-distributed-optimizer,get a new error. Env: 4*8=32 a100 gpu, tp2 pp8

[2024-06-03 08:11:01,842] [INFO] [ckpt_saver.py:892:commit_checkpoint] The number of ready shards is 26 != 32.
Exception in thread checkpoint-saver:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 429, in _sa
ver
    cls._saver_instance._sync_shm_to_storage()
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 535, in _sy
nc_shm_to_storage
    self.save_step_checkpoint(event.step)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 854, in sav
e_step_checkpoint
    self.commit_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 881, in com
mit_checkpoint
    done_files = self.storage.listdir(step_done_dir)
  File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/storage.py", line 179, in listdir
    return os.listdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/work/wqssd-cfs/data/LLM/multi-node-train/megatron-lm
-train/checkpoint/qy/multi-node/gpt3-1.5B-flashckpt/._dlrover_ckpt_stage/20.done'

This error is related to the storage I use and can be ignored.

The job will be hang in the end.