KimMeen/Time-LLM

Error in outputs, batch_y = accelerator.gather_for_metrics((outputs, batch_y))

Closed this issue · 1 comments

Hi, has anyone encountered this error? (I tried to increase batch_size to 64, 128)
Training runs, but when it comes to vali_loss, vali_mae_loss = vali(args, accelerator, model, vali_data, vali_loader, criterion, mae_metric) in this row outputs, batch_y = accelerator.gather_for_metrics((outputs, batch_y)) I get the following error:

File ".local/lib/python3.11/site-packages/accelerate/accelerator.py", line 2242, in gather_for_metrics
data = self.gather(input_data)
^^^^^^^^^^^^^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/accelerator.py", line 2205, in gather
return gather(tensor)
^^^^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 378, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 439, in gather
return _gpu_gather(tensor)
^^^^^^^^^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 358, in _gpu_gather
return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 107, in recursively_apply
return honor_type(
^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 81, in honor_type
return type(obj)(generator)
^^^^^^^^^^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 110, in
recursively_apply(
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
return func(data, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".local/lib/python3.11/site-packages/accelerate/utils/operations.py", line 355, in _gpu_gather_one
torch.distributed.all_gather(output_tensors, tensor)
File ".local/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/.local/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2615, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc async 24 bytes

Hi, this error seems to be caused by an incorrect CUDA device environment or an inconsistent number of devices. Our default script requires 8 A100 GPUs for execution. Do you have this number of CUDA devices in your running environment?