alpa.test_install error
vectercyg opened this issue · 3 comments
vectercyg commented
System information and environment
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): centos7
- Python version: 3.9
- CUDA version: 11.3
- NCCL version: 2.16.5
- cupy version: 10.6.0
- GPU model and memory: Titan v
- Alpa version: 1.00.dev0
- TensorFlow version:
- JAX version: 0.3.22
- ray version: 2.3.0
Please describe the bug
After installation according to the document, run the test, and the following error occurs.
Note:I tried to switch the version of ray to 1.13.0, but it didn't work.
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/test_install.py", line 65, in <module>
runner.run(suite())
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/runner.py", line 184, in run
test(result)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/suite.py", line 122, in run
test(result)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/case.py", line 651, in __call__
return self.run(*args, **kwds)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/case.py", line 592, in run
self._callTestMethod(testMethod)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
method()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/api.py", line 121, in __call__
self._decode_args_and_get_executable(*args))
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/api.py", line 191, in _decode_args_and_get_executable
executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/jax/linear_util.py", line 309, in memoized_fun
ans = call(fun, *args)
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/api.py", line 223, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/parallel_method.py", line 240, in compile_executable
return compile_pipeshard_executable(
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 112, in compile_pipeshard_executable
pipeshard_config = compile_pipeshard_executable_internal(
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 292, in compile_pipeshard_executable_internal
pipeshard_config = emitter_cls(**emitter_kwargs).compile()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 396, in compile
self._compile_resharding_tasks()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 335, in _compile_resharding_tasks
var] = SymbolicReshardingTask(spec, cg, src_mesh,
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 199, in __init__
self._compile()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 224, in _compile
self.put_all_tasks()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 244, in put_all_tasks
ray.get(task_dones)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/worker.py", line 2382, in get
raise value
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=48050, ip=192.168.1.197, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f89f2834610>)
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/device_mesh.py", line 124, in __init__
self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: UNAVAILABLE: Socket closed
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 112, in compile_pipeshard_executable
pipeshard_config = compile_pipeshard_executable_internal(
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/compile_executable.py", line 292, in compile_pipeshard_executable_internal
pipeshard_config = emitter_cls(**emitter_kwargs).compile()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 396, in compile
self._compile_resharding_tasks()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/runtime_emitter.py", line 335, in _compile_resharding_tasks
var] = SymbolicReshardingTask(spec, cg, src_mesh,
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 199, in __init__
self._compile()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 224, in _compile
self.put_all_tasks()
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/pipeline_parallel/cross_mesh_resharding.py", line 244, in put_all_tasks
ray.get(task_dones)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/pc/anaconda3/envs/cyg_alpa/lib/python3.9/site-packages/ray/_private/worker.py", line 2382, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=48050, ip=192.168.1.197, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f89f2834610>)
File "/home/pc/aibot/cuiyonggan/projects/Pipeline_Experiments/alpa/alpa/device_mesh.py", line 124, in __init__
self.distributed_client.connect()
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: UNAVAILABLE: Socket closed
----------------------------------------------------------------------
Ran 2 tests in 50.295s
zhisbug commented
@vectercyg it seems some range of your socket ports are closed and Alpa/XLA cannot launch clients.
vectercyg commented
@vectercyg似乎您的某些套接字端口已关闭,Alpa/XLA 无法启动客户端。
Hello, thank you for your reply. I am running this test on a machine with 2GPU. Will this also be involved?
vectercyg commented
Hello, it should be the question you said. The problem disappeared after I restarted the machine.