Hsword/Hetu

AssertionError: assert params_size == len(grads)

Fragile-azalea opened this issue · 1 comments

Describe the bug
The program accidentally terminates because params_size is not equal to the length of grads. I am eager to get the author's help, thanks.

To Reproduce
Steps to reproduce the behavior:

  1. cd example/moe
  2. NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 1 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16
  3. (substitute for step 2 but get the same log) NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 2 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16

Logs

$NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 1 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16
2022-11-29 10:08:01,460 - __main__ - INFO - Training MoE Examples on HETU
device_id:  0
2022-11-29 10:08:03,679 - __main__ - INFO - Step 0
Traceback (most recent call last):
  File "test_moe_top.py", line 86, in <module>
    loss_val, predict_y, y_val, _  = executor.run(
  File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 446, in run
    return self.subexecutor[name].run(eval_node_list, feed_dict, convert_to_numpy_ret_vals, **kwargs)
  File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 972, in run
    self.compute(self.computing_nodes,
  File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 1048, in compute
    node.compute(input_vals, node_val, cur_stream)
  File "/home/xinglinpan/Hetu/python/hetu/optimizer.py", line 116, in compute
    self.optimizer.update(input_vals, stream_handle)
  File "/home/xinglinpan/Hetu/python/hetu/optimizer.py", line 188, in update
    assert params_size == len(grads)
AssertionError
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[1691,1],0]
  Exit code:    1
--------------------------------------------------------------------------

params_size is 6 and len(grads) is 2

Platform

  • Device: GeForce RTX 2080Ti * 4
  • OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • CUDA version: 10.2
  • NCCL version: 2.10.3
  • PyTorch version: 1.9.1
  • Python Version: 3.8

Thanks for your patience. We have fixed this bug and please try to use #66.