alpa-projects/alpa

alpa.test_install error

vectercyg opened this issue · 3 comments

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): centos7
  • Python version: 3.9
  • CUDA version: 11.3
  • NCCL version: 2.16.5
  • cupy version: 10.6.0
  • GPU model and memory: Titan v
  • Alpa version: 1.00.dev0
  • TensorFlow version:
  • JAX version: 0.3.22
  • ray version: 2.3.0

Please describe the bug
After installation according to the document, run the test, and the following error occurs.
Note:I tried to switch the version of ray to 1.13.0, but it didn't work.

======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/test_install.py", line 65, in <module>
    runner.run(suite())
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/runner.py", line 184, in run
    test(result)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/suite.py", line 122, in run
    test(result)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/case.py", line 651, in __call__
    return self.run(*args, **kwds)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/case.py", line 592, in run
    self._callTestMethod(testMethod)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
    method()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/api.py", line 121, in __call__
    self._decode_args_and_get_executable(*args))
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/api.py", line 191, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/jax/linear_util.py", line 309, in memoized_fun
    ans = call(fun, *args)
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/api.py", line 223, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/parallel_method.py", line 240, in compile_executable
    return compile_pipeshard_executable(
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 112, in compile_pipeshard_executable
    pipeshard_config = compile_pipeshard_executable_internal(
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 292, in compile_pipeshard_executable_internal
    pipeshard_config = emitter_cls(**emitter_kwargs).compile()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 396, in compile
    self._compile_resharding_tasks()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 335, in _compile_resharding_tasks
    var] = SymbolicReshardingTask(spec, cg, src_mesh,
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 199, in __init__
    self._compile()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 224, in _compile
    self.put_all_tasks()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 244, in put_all_tasks
    ray.get(task_dones)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/worker.py", line 2382, in get
    raise value
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=48050, ip=192.168.1.197, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f89f2834610>)
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/device_mesh.py", line 124, in __init__
    self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: UNAVAILABLE: Socket closed

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 112, in compile_pipeshard_executable
    pipeshard_config = compile_pipeshard_executable_internal(
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 292, in compile_pipeshard_executable_internal
    pipeshard_config = emitter_cls(**emitter_kwargs).compile()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 396, in compile
    self._compile_resharding_tasks()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 335, in _compile_resharding_tasks
    var] = SymbolicReshardingTask(spec, cg, src_mesh,
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 199, in __init__
    self._compile()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 224, in _compile
    self.put_all_tasks()
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 244, in put_all_tasks
    ray.get(task_dones)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/worker.py", line 2382, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=48050, ip=192.168.1.197, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f89f2834610>)
  File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/device_mesh.py", line 124, in __init__
    self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: UNAVAILABLE: Socket closed

----------------------------------------------------------------------
Ran 2 tests in 50.295s

@vectercyg it seems some range of your socket ports are closed and Alpa/XLA cannot launch clients.

@vectercyg似乎您的某些套接字端口已关闭,Alpa/XLA 无法启动客户端。
Hello, thank you for your reply. I am running this test on a machine with 2GPU. Will this also be involved?

Hello, it should be the question you said. The problem disappeared after I restarted the machine.