Error encountered when using falsh checkpoint
chencjcj opened this issue · 3 comments
I used flash checkpoint to run training in magatron-lm and encountered an error when saving the checkpoint,The training has been stopped here.
[2024-05-29 06:41:59,152] [INFO] [engine.py:130:start_saver_process] Start a process to asynchronously save checkpoint.
[2024-05-29 06:41:59,152] [INFO] [engine.py:130:start_saver_process] Start a process to asynchronously save checkpoint.
[2024-05-29 06:41:59,153] [INFO] [engine.py:44:_local_rank0_log] Use the default process group to sync when saving checkpoint.
[2024-05-29 06:41:59,153] [INFO] [engine.py:44:_local_rank0_log] Use the default process group to sync when saving checkpoint.
[2024-05-29 06:41:59,158] [INFO] [ckpt_saver.py:434:_factory] Start the checkpoint saver factory.
[2024-05-29 06:41:59,159] [INFO] [ckpt_saver.py:434:_factory] Start the checkpoint saver factory.
[2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:399:init] Initialize the AsyncSaver with arguments: checkpoint_dir=./ckpt, local_shard_num=1, global_shard_num=1,
[2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:522:_sync_shm_to_storage] Async flash checkpoint saver starts!
[2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:399:init] Initialize the AsyncSaver with arguments: checkpoint_dir=./ckpt, local_shard_num=1, global_shard_num=1,
[2024-05-29 06:42:00,163] [INFO] [ckpt_saver.py:522:_sync_shm_to_storage] Async flash checkpoint saver starts!
[2024-05-29 06:42:01,159] [INFO] [ckpt_saver.py:526:_sync_shm_to_storage] Reset the shared memory after the training starts. The number of global shards is 1.
[2024-05-29 06:42:01,160] [INFO] [ckpt_saver.py:526:_sync_shm_to_storage] Reset the shared memory after the training starts. The number of global shards is 1.
saving checkpoint at iteration 20 to ./ckpt
[2024-05-29 06:42:01,161] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: True.
[2024-05-29 06:42:01,171] [INFO] [engine.py:303:save_state_dict_to_memory] 0 acquired the lock of shared memory: False.
Megatron-lm:main
dlover:v0.3.7