alpa-projects/alpa

AttributeError: 'DistributedArray' object has no attribute 'uuid'

chaokunyang opened this issue · 0 comments

Please describe the bug
When call DistributedArray.block_until_ready just like jax DeviceArray, alpa raise AttributeError:

2023-03-22 23:41:18,342	ERROR tensor_computing_v1.py:87 -- 'DistributedArray' object has no attribute 'uuid'
Traceback (most recent call last):
  File "/home/admin/ray-pack/tmp/job/05000080/package/tensor_computing_v1.py", line 85, in main
    alpa_compute()
  File "/home/admin/ray-pack/tmp/job/05000080/package/tensor_computing_v1.py", line 60, in alpa_compute
    results = timeit.repeat(
  File "/home/admin/micromamba/envs/alpa/lib/python3.8/timeit.py", line 238, in repeat
    return Timer(stmt, setup, timer, globals).repeat(repeat, number)
  File "/home/admin/micromamba/envs/alpa/lib/python3.8/timeit.py", line 205, in repeat
    t = self.timeit(number)
  File "/home/admin/micromamba/envs/alpa/lib/python3.8/timeit.py", line 177, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "/home/admin/ray-pack/tmp/job/05000080/package/tensor_computing_v1.py", line 56, in execute
    result.block_until_ready()
  File "/home/admin/ray-pack/tmp/job/05000080/pyenv/lib/python3.8/site-packages/alpa/device_mesh.py", line 1536, in block_until_ready
    self.device_mesh.block_until_ready_remote_buffers(self.uuid)
AttributeError: 'DistributedArray' object has no attribute 'uuid'

Please describe the expected behavior

System information and environment
System information and environment

  • OS Platform and Distribution: Linux REPL7 docker
  • Python version: 3.8.16
  • CUDA version: 11.2
  • NCCL version: nccl-2.14.3.1
  • cupy version: 11.6.0
  • GPU model and memory: A10
  • Alpa version: master
  • TensorFlow version: not installed
  • JAX version: jaxlib-0.3.22.cuda112.cudnn810-cp38-cp38

To Reproduce

Screenshots
image
Code snippet to reproduce the problem

def matrix_compute(t):
    import jax.numpy as np
    t_transposed = np.transpose(t)
    dot_matrix = np.dot(t, t_transposed)
    v1_row_norm = np.linalg.norm(t, axis=1).reshape(-1, 1)
    v2_col_norm = np.linalg.norm(t_transposed, axis=0).reshape(1, -1)
    norm_matrix = np.dot(v1_row_norm, v2_col_norm)
    res = dot_matrix / norm_matrix
    res = np.where(np.isneginf(res), 0, res)
    return res

matrix_compute_jit = alpa.parallelize(matrix_compute)
result = matrix_compute_jit(vector_matrix)
result.block_until_ready()

Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.