Hsword/Hetu

Questions about moe examples.

Qianshaowei opened this issue · 0 comments

Hi, I'm unable to run through the MOE sample.(test_moe_top.py)
The error message is as follows:

2024-04-26 15:13:06,594 - __main__ - INFO - Training MoE Examples on HETU
libibverbs: Warning: couldn't load driver '/usr/local/infiniband/lib/libibverbs/libmlx4': /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/infiniband/lib/libibverbs/libmlx4-rdmav2.so)
device_id:  0
Traceback (most recent call last):
  File "test_moe_top.py", line 81, in <module>
    comm_mode=args.comm_mode)
  File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 463, in __init__
    train_name=train_name, val_name=val_name, **kargs)
  File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 418, in __init__
    topo_sort_with_hook(self.my_eval_nodes, self)
  File "/data//MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 1499, in topo_sort_with_hook
    topo_sort_dfs_with_hook(node, visited, config)
  File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 1506, in topo_sort_dfs_with_hook
    node.backward_hook(config)
  File "/data/MoE/Hetu-main/python/hetu/optimizer.py", line 174, in backward_hook
    cur_node, config.param_allreduce_group.get(cur_param, config.nccl_comm))
AttributeError: 'HetuConfig' object has no attribute 'param_allreduce_group'
Exception ignored in: <function Executor.__del__ at 0x7f63122fdef0>