Questions about moe examples.
Qianshaowei opened this issue · 0 comments
Qianshaowei commented
Hi, I'm unable to run through the MOE sample.(test_moe_top.py)
The error message is as follows:
2024-04-26 15:13:06,594 - __main__ - INFO - Training MoE Examples on HETU
libibverbs: Warning: couldn't load driver '/usr/local/infiniband/lib/libibverbs/libmlx4': /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/infiniband/lib/libibverbs/libmlx4-rdmav2.so)
device_id: 0
Traceback (most recent call last):
File "test_moe_top.py", line 81, in <module>
comm_mode=args.comm_mode)
File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 463, in __init__
train_name=train_name, val_name=val_name, **kargs)
File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 418, in __init__
topo_sort_with_hook(self.my_eval_nodes, self)
File "/data//MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 1499, in topo_sort_with_hook
topo_sort_dfs_with_hook(node, visited, config)
File "/data/MoE/Hetu-main/python/hetu/gpu_ops/executor.py", line 1506, in topo_sort_dfs_with_hook
node.backward_hook(config)
File "/data/MoE/Hetu-main/python/hetu/optimizer.py", line 174, in backward_hook
cur_node, config.param_allreduce_group.get(cur_param, config.nccl_comm))
AttributeError: 'HetuConfig' object has no attribute 'param_allreduce_group'
Exception ignored in: <function Executor.__del__ at 0x7f63122fdef0>