RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.
wqf321 opened this issue · 2 comments
wqf321 commented
hi, i got a problem when run the command " horovodrun -np 2 python pretrain.py --config config/pretrain-tv-16gpu.json --output_dir ./pre_train_ckpt/ckpt/ ", could you please help me?
0%| | 500/100000 [06:57<22:23:43, 1.23it/s][1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - -------------------------------------------
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - Step 500:
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - mlm_tv_all: 3384 examples trained at 8 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - mfm-nce_tv_all: 3192 examples trained at 7 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - fom_tv_all: 1968 examples trained at 4 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - vsm_tv_all: 3456 examples trained at 8 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - ===========================================
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - Step 500: start running validation
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - validate on mlm_tv_all task
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ - start running MLM validation...
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ - validation finished in 2 seconds, acc: 2.41
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ - validate on mfm-nce_tv_all task
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ - start running MFM-NCE validation...
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ - validation finished in 2 seconds, loss: 15.16, acc: 1.99 (average 350 negatives)
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ - validate on fom_tv_all task
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ - start running FOM validation...
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ - validation finished in 2 seconds, score: 1.92
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ - validate on vsm_tv_all task
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ - start running VSM validation...
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "pretrain.py", line 621, in <module>
[1,1]<stderr>: main(args)
[1,1]<stderr>: File "pretrain.py", line 372, in main
[1,1]<stderr>: validate(model, val_dataloaders, opts)
[1,1]<stderr>: File "pretrain.py", line 403, in validate
[1,1]<stderr>: val_log = validate_vsm(model, loader, opts)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
[1,1]<stderr>: return func(*args, **kwargs)
[1,1]<stderr>: File "pretrain.py", line 436, in validate_vsm
[1,1]<stderr>: val_loss_st_ed = sum(all_gather_list(val_loss_st_ed))
[1,1]<stderr>: File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/utils/distributed.py", line 190, in all_gather_list
[1,1]<stderr>: out_buffer = hvd.allgather(in_buffer[:enc_byte+enc_size])
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 287, in allgather
[1,1]<stderr>: return HorovodAllgather.apply(tensor, name)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 250, in forward
[1,1]<stderr>: return synchronize(handle)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 443, in synchronize
[1,1]<stderr>: mpi_lib.horovod_torch_wait_and_clear(handle)
[1,1]<stderr>:RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "pretrain.py", line 621, in <module>
[1,0]<stderr>: main(args)
[1,0]<stderr>: File "pretrain.py", line 372, in main
[1,0]<stderr>: validate(model, val_dataloaders, opts)
[1,0]<stderr>: File "pretrain.py", line 403, in validate
[1,0]<stderr>: val_log = validate_vsm(model, loader, opts)
[1,0]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
[1,0]<stderr>: return func(*args, **kwargs)
[1,0]<stderr>: File "pretrain.py", line 427, in validate_vsm
[1,0]<stderr>: model(batch, 'vsm', compute_loss=True)
[1,0]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 545, in __call__
[1,0]<stderr>: result = self.forward(*input, **kwargs)
[1,0]<stderr>: File "/opt/conda/lib/python3.6/site-packages/apex/amp/_initialize.py", line 194, in new_fwd
[1,0]<stderr>: **applier(kwargs, input_caster))
[1,0]<stderr>: File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 84, in forward
[1,0]<stderr>: batch['c_attn_masks'])
[1,0]<stderr>: File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 400, in get_video_level_scores
[1,0]<stderr>: modularized_query = vsm_allgather(modularized_query).contiguous()
[1,0]<stderr>: File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 452, in vsm_allgather
[1,0]<stderr>: return VsmAllgather.apply(tensor, None)
[1,0]<stderr>: File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 437, in forward
[1,0]<stderr>: torch.tensor([ctx.dim], device=tensor.device)
[1,0]<stderr>: File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 287, in allgather
[1,0]<stderr>: return HorovodAllgather.apply(tensor, name)
[1,0]<stderr>: File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 250, in forward
[1,0]<stderr>: return synchronize(handle)
[1,0]<stderr>: File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 443, in synchronize
[1,0]<stderr>: mpi_lib.horovod_torch_wait_and_clear(handle)
[1,0]<stderr>:RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[28050,1],1]
Exit code: 1
--------------------------------------------------------------------------
linjieli222 commented
Hi there,
Thanks for your interests in our HERO project. We did not run into this issue during our experiments. I see you are using your own virtual environment, which might be where the discrepancies come from in your experiment.
From a first glance, there is one tensor that has different data type across ranks (unit 8 vs. int64). My suggestion is to find out which tensor it is exactly from "model/pretrain.py, line 452" and cast it into the same data type across all ranks.
linjieli222 commented
Closed due to inactivity