client.connect(path) error when saving checkpoint
Opened this issue · 7 comments
When using dlrover to save checkpoints, the following error will always occur:
[2024-11-15 12:30:37,876] [INFO] [engine.py:131:start_saver_process] Start a process to asynchronously save checkpoint.
[2024-11-15 12:30:37,879] [INFO] [engine.py:299:_notify_agent_to_create_saver] Notify agent to create a checkpoint saver using: {'module_path': 'dlrover.python.elastic_agent.torch.ckpt_saver', 'class_name': 'DeepSpeedCheckpointSaver', 'kwargs': {'checkpoint_dir': '/work/share/chenyd/finetune/ChatGLM2-6B/model/checkpoints_out/ALL/original/chatglm2-6b/checkpoint-15', 'storage_meta': ClassMeta(module_path='dlrover.python.common.storage', class_name='PosixDiskStorage', kwargs={}), 'local_shard_num': 8, 'global_shard_num': 16, 'save_timeout': 600}}.
[2024-11-15 12:30:37,879] [WARNING] [multi_process.py:91:_create_socket_client] Unexpected error when creating socket client by path: /tmp/ckpt_sock/1857279191730585602/sharedqueue_factory.sock, error: [Errno 2] No such file or directory
Traceback (most recent call last):
File "/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py", line 89, in _create_socket_client
client.connect(path)
FileNotFoundError: [Errno 2] No such file or directory
[2024-11-15 12:30:37,895] [INFO] [ckpt_saver.py:451:_factory] Start the checkpoint saver factory.
/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:48: ResourceWarning: unclosed <socket.socket fd=116, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
time.sleep(1)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
The code used is as follows:
checkpointer = DeepSpeedCheckpointer(model, output_dir)
result = checkpointer.save_checkpoint(
output_dir,
tag=self.state.global_step,
storage_type=StorageType.DISK
)
How to solve this problem? I really hope to receive a reply.
In addition, there are always warnings like this during the saving process. How can I eliminate them?
/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:271: ResourceWarning: unclosed <socket.socket fd=127, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, laddr=/tmp/ckpt_sock/1857345181448912897/sharedlock_shm_lock_1.sock>
connection, _ = self._server.accept()
ResourceWarning: Enable tracemalloc to get the object allocation traceback
What's ur version? Maybe u can try master branch with this fixed: #1261
What's ur version? Maybe u can try master branch with this fixed: #1261
Thank you for your reply. I used pip install dlrover
and the installed version is 0.3.8. Is there any difference between it and the master branch.
What's ur version? Maybe u can try master branch with this fixed: #1261
Thank you for your reply. I used
pip install dlrover
and the installed version is 0.3.8. Is there any difference between it and the master branch.
If u r using 0.3.8, seems not the same issue i just provided.
- Are u using 'dlrover-run' to init ur training?
- What is the env 'ROLE_NAME' value(process env) in ur case?
What's ur version? Maybe u can try master branch with this fixed: #1261
Thank you for your reply. I used
pip install dlrover
and the installed version is 0.3.8. Is there any difference between it and the master branch.If u r using 0.3.8, seems not the same issue i just provided.
- Are u using 'dlrover-run' to init ur training?
- What is the env 'ROLE_NAME' value(process env) in ur case?
1.Not using 'dlover run', using 'torchrun', only using dlrover when saving checkpoint.
2.The value of env'ROLE-NAME' has not been set. What should it be set to?
What's ur version? Maybe u can try master branch with this fixed: #1261
Thank you for your reply. I used
pip install dlrover
and the installed version is 0.3.8. Is there any difference between it and the master branch.If u r using 0.3.8, seems not the same issue i just provided.
- Are u using 'dlrover-run' to init ur training?
- What is the env 'ROLE_NAME' value(process env) in ur case?
1.Not using 'dlover run', using 'torchrun', only using dlrover when saving checkpoint. 2.The value of env'ROLE-NAME' has not been set. What should it be set to?
So u r expecting to create a sub process(saver) to do the checkpoint issue during training?
Need more logging info of ur context.
Probably the same issue of #1361