intelligent-machine-learning/dlrover

client.connect(path) error when saving checkpoint

Opened this issue · 7 comments

When using dlrover to save checkpoints, the following error will always occur:

[2024-11-15 12:30:37,876] [INFO] [engine.py:131:start_saver_process] Start a process to asynchronously save checkpoint.
[2024-11-15 12:30:37,879] [INFO] [engine.py:299:_notify_agent_to_create_saver] Notify agent to create a checkpoint saver using: {'module_path': 'dlrover.python.elastic_agent.torch.ckpt_saver', 'class_name': 'DeepSpeedCheckpointSaver', 'kwargs': {'checkpoint_dir': '/work/share/chenyd/finetune/ChatGLM2-6B/model/checkpoints_out/ALL/original/chatglm2-6b/checkpoint-15', 'storage_meta': ClassMeta(module_path='dlrover.python.common.storage', class_name='PosixDiskStorage', kwargs={}), 'local_shard_num': 8, 'global_shard_num': 16, 'save_timeout': 600}}.
[2024-11-15 12:30:37,879] [WARNING] [multi_process.py:91:_create_socket_client] Unexpected error when creating socket client by path: /tmp/ckpt_sock/1857279191730585602/sharedqueue_factory.sock, error: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py", line 89, in _create_socket_client
    client.connect(path)
FileNotFoundError: [Errno 2] No such file or directory
[2024-11-15 12:30:37,895] [INFO] [ckpt_saver.py:451:_factory] Start the checkpoint saver factory.
/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:48: ResourceWarning: unclosed <socket.socket fd=116, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
  time.sleep(1)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

The code used is as follows:

           checkpointer = DeepSpeedCheckpointer(model, output_dir)
            result = checkpointer.save_checkpoint(
            output_dir,
            tag=self.state.global_step,
            storage_type=StorageType.DISK
            )

How to solve this problem? I really hope to receive a reply.

In addition, there are always warnings like this during the saving process. How can I eliminate them?

/usr/local/python3.9.10/lib/python3.9/site-packages/dlrover/python/common/multi_process.py:271: ResourceWarning: unclosed <socket.socket fd=127, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, laddr=/tmp/ckpt_sock/1857345181448912897/sharedlock_shm_lock_1.sock>
  connection, _ = self._server.accept()
ResourceWarning: Enable tracemalloc to get the object allocation traceback

What's ur version? Maybe u can try master branch with this fixed: #1261

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

If u r using 0.3.8, seems not the same issue i just provided.

  1. Are u using 'dlrover-run' to init ur training?
  2. What is the env 'ROLE_NAME' value(process env) in ur case?

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

If u r using 0.3.8, seems not the same issue i just provided.

  1. Are u using 'dlrover-run' to init ur training?
  2. What is the env 'ROLE_NAME' value(process env) in ur case?

1.Not using 'dlover run', using 'torchrun', only using dlrover when saving checkpoint.
2.The value of env'ROLE-NAME' has not been set. What should it be set to?

What's ur version? Maybe u can try master branch with this fixed: #1261

Thank you for your reply. I used pip install dlrover and the installed version is 0.3.8. Is there any difference between it and the master branch.

If u r using 0.3.8, seems not the same issue i just provided.

  1. Are u using 'dlrover-run' to init ur training?
  2. What is the env 'ROLE_NAME' value(process env) in ur case?

1.Not using 'dlover run', using 'torchrun', only using dlrover when saving checkpoint. 2.The value of env'ROLE-NAME' has not been set. What should it be set to?

So u r expecting to create a sub process(saver) to do the checkpoint issue during training?

Need more logging info of ur context.

Probably the same issue of #1361