AssertionError: assert params_size == len(grads)
Fragile-azalea opened this issue · 1 comments
Fragile-azalea commented
Describe the bug
The program accidentally terminates because params_size is not equal to the length of grads. I am eager to get the author's help, thanks.
To Reproduce
Steps to reproduce the behavior:
- cd example/moe
- NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 1 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16
- (substitute for step 2 but get the same log) NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 2 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16
Logs
$NCCL_DEBUG=DEBUG mpirun --mca btl '^openib' -np 1 python test_moe_top.py --top=1 --num_local_experts=2 --batch_size=16
2022-11-29 10:08:01,460 - __main__ - INFO - Training MoE Examples on HETU
device_id: 0
2022-11-29 10:08:03,679 - __main__ - INFO - Step 0
Traceback (most recent call last):
File "test_moe_top.py", line 86, in <module>
loss_val, predict_y, y_val, _ = executor.run(
File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 446, in run
return self.subexecutor[name].run(eval_node_list, feed_dict, convert_to_numpy_ret_vals, **kwargs)
File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 972, in run
self.compute(self.computing_nodes,
File "/home/xinglinpan/Hetu/python/hetu/gpu_ops/executor.py", line 1048, in compute
node.compute(input_vals, node_val, cur_stream)
File "/home/xinglinpan/Hetu/python/hetu/optimizer.py", line 116, in compute
self.optimizer.update(input_vals, stream_handle)
File "/home/xinglinpan/Hetu/python/hetu/optimizer.py", line 188, in update
assert params_size == len(grads)
AssertionError
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[1691,1],0]
Exit code: 1
--------------------------------------------------------------------------
params_size is 6 and len(grads) is 2
Platform
- Device: GeForce RTX 2080Ti * 4
- OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- CUDA version: 10.2
- NCCL version: 2.10.3
- PyTorch version: 1.9.1
- Python Version: 3.8