Distributed operation

Question

Distributed operation

Tobelakers opened this issue 9 months ago · 7 comments

Hello, thank you for your excellent work and salute you.
When I reproduced the code, Use the following command: python-mtrch.distributed.run-nproc _ per _ node = 8main.py-mode submit-config-path/home/sunzhaojie/memot/outputs/memot _ mot17/train/ config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model dab_deformable_detr.pth --use-distributed --data-root /home/sunzhaojie/MeMOTR/dataset/MOT17
The following error occurred while running the code:

Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
Traceback (most recent call last):
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3949687 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3949688 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 3949689) of binary: /home/sunzhaojie/.conda/envs/mot13/bin/python
Traceback (most recent call last):
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3949690)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3949691)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3949692)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3949693)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3949694)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3949689)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

How can I solve this problem? Hope to reply!

Answer 1 · 2023-12-15T10:51:02.000Z

Hello, I have modified some codes and run commands:
python main.py --mode submit --config-path /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml --submit-dir /home/sunzhaojie/MeMOTR/ outputs/memotr_mot17/ --submit-model memotr_mot17.pth --use-distributed --data-root /home/sunzhaojie/MeMOTR/dataset/
To run the code on the specified CUDA device number without using a distributed method. As a result, the following error occurred:

Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 121, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main
submit(config=config)
File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 224, in submit
submitter.run()
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 73, in run
res = self.model(frame=frame, tracks=tracks)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/memotr.py", line 133, in forward
outputs, init_reference, inter_references, inter_queries = self.transformer(
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/deformable_transformer.py", line 225, in forward
memory = checkpoint(self.encoder, src_flatten, spatial_shapes, level_start_index,
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 251, in checkpoint
return _checkpoint_without_reentrant(
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 420, in _checkpoint_without_reentrant
output = function(*args, **kwargs)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 59, in forward
output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 124, in forward
src2 = self.self_attn(self.with_pos_embed(src, pos),
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/ops/modules/ms_deform_attn.py", line 129, in forward
output = self.output_proj(output)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())
How can I solve this problem? Please reply! thank you

Answer 2 · 2023-12-15T12:16:00.000Z

I have not seen this error before. However, based on the error message you provided, it seems like there might be an issue with the CUDA memory or driver version on your system. May I ask for some details (like nvidia driver version/cuda version/pytorch version) of your environment?

And one more thing: I have reviewed my code and found an error in the configuration related to model training. I have already fixed it on the latest commit.

Answer 3 · 2023-12-15T12:32:36.000Z

Thank you for your reply. Sorry, I haven't been exposed to distributed programs before.
My cuda is 11.7 and torch is 1.13.1, so there should be no problem with this.
I use two 3090 graphics cards, which should be nproc_per_node=2. But I modified the run command to python-mtrch. distributed.run-nproc _ per _ node = 2main.py-mode submit-config-path/home/sunzhaojie/memot/outputs/memot _ mot17/train/ config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model memotr_mot17.pth --use-distributed --data-root /home/sunzhaojie/MeMOTR/dataset/
A new error has occurred:
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 121, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main
submit(config=config)
File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 196, in submit
model = DDP(module=model, device_ids=[distributed_rank()], find_unused_parameters=False)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device b3000
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 121, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main
submit(config=config)
File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 196, in submit
model = DDP(module=model, device_ids=[distributed_rank()], find_unused_parameters=False)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 655, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device b3000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4019226) of binary: /home/sunzhaojie/.conda/envs/mot13/bin/python
Traceback (most recent call last):
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2023-12-15_19:59:43
host : ubuntu-Precision-7920-Tower
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 4019227)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-15_19:59:43
host : ubuntu-Precision-7920-Tower
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4019226)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Please answer me, how can I solve it? Thank you.

Answer 4 · 2023-12-15T12:41:59.000Z

Can you try this script and see if there is any error comes out?

python main.py --mode submit --config-path /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model memotr_mot17.pth --data-root /home/sunzhaojie/MeMOTR/dataset/

Different from your previous script without DDP, it removes --use-distributed.

Answer 5 · 2023-12-15T12:54:11.000Z

Thank you very much for your command, and it has been able to run, but the following errors occurred during the operation:

(mot13) sunzhaojie@ubuntu-Precision-7920-Tower:~/MeMOTR$ python main.py --mode submit --config-path /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model memotr_mot17.pth --data-root /home/sunzhaojie/MeMOTR/dataset/

Configs: {'ACCUMULATION_STEPS': 1, 'ACTIVATION': 'ReLU', 'AUX_LOSS': True, 'AUX_LOSS_WEIGHT': [1.0, 1.0, 1.0, 1.0, 1.0], 'AVAILABLE_GPUS': '0,1', 'BACKBONE': 'resnet50', 'BATCH_SIZE': 1, 'CHECKPOINT_LEVEL': 2, 'CLIP_MAX_NORM': 0.1, 'COCO_SIZE': True, 'CONFIG_PATH': '/home/sunzhaojie/MeMOTR/outputs/memotr_mot17/train/config.yaml', 'DATASET': 'MOT17', 'DATA_PATH': None, 'DATA_ROOT': '/home/sunzhaojie/MeMOTR/dataset/', 'DET_SCORE_THRESH': 0.5, 'DEVICE': 'cuda:1', 'DROPOUT': 0.0, 'EPOCHS': 130, 'EVAL_DATA_SPLIT': 'val', 'EVAL_DIR': None, 'EVAL_MODE': 'specific', 'EVAL_MODEL': None, 'EVAL_PORT': None, 'EVAL_THREADS': 1, 'EXTRA_TRACK_ATTN': False, 'FFN_DIM': 2048, 'FP_INSERT_RATE': 0.0, 'GIT_VERSION': None, 'HIDDEN_DIM': 256, 'LONG_MEMORY_LAMBDA': 0.01, 'LOSS_WEIGHT_FOCAL': 2, 'LOSS_WEIGHT_GIOU': 2, 'LOSS_WEIGHT_L1': 5, 'LR': 0.0002, 'LR_BACKBONE': 2e-05, 'LR_DROP_MILESTONES': [120], 'LR_DROP_RATE': 0.1, 'LR_POINTS': 2e-05, 'LR_SCHEDULER': 'MultiStep', 'MATCH_COST_BBOX': 5, 'MATCH_COST_CLASS': 2, 'MATCH_COST_GIOU': 2, 'MERGE_DET_TRACK_LAYER': 1, 'MISS_TOLERANCE': 15, 'MODE': 'submit', 'MOTION_LAMBDA': 0.5, 'MOTION_MAX_LENGTH': 5, 'MOTION_MIN_LENGTH': 3, 'MOTSYNTH_RATE': None, 'MULTI_CHECKPOINT': False, 'NUM_DEC_LAYERS': 6, 'NUM_DEC_POINTS': 4, 'NUM_DET_QUERIES': 300, 'NUM_ENC_LAYERS': 6, 'NUM_ENC_POINTS': 4, 'NUM_FEATURE_LEVELS': 4, 'NUM_HEADS': 8, 'NUM_WORKERS': 4, 'ONLY_TRAIN_QUERY_UPDATER_AFTER': 130, 'OUTPUTS_DIR': '/home/sunzhaojie/MeMOTR/outputs/MOT17', 'OVERFLOW_BBOX': True, 'PRETRAINED_MODEL': 'dab_deformable_detr.pth', 'RESULT_SCORE_THRESH': 0.5, 'RESUME': None, 'RESUME_SCHEDULER': True, 'RETURN_INTER_DEC': True, 'REVERSE_CLIP': 0.0, 'SAMPLE_INTERVALS': [10], 'SAMPLE_LENGTHS': [2, 3, 4, 5], 'SAMPLE_MODES': ['random_interval'], 'SAMPLE_MOT17_JOIN': 0, 'SAMPLE_STEPS': [60, 100], 'SEED': 42, 'SUBMIT_DATA_SPLIT': 'test', 'SUBMIT_DIR': '/home/sunzhaojie/MeMOTR/outputs/memotr_mot17/', 'SUBMIT_MODEL': 'memotr_mot17.pth', 'TP_DROP_RATE': 0.0, 'TRACK_SCORE_THRESH': 0.5, 'UPDATE_THRESH': 0.5, 'USE_CHECKPOINT': False, 'USE_CROWDHUMAN': None, 'USE_DAB': True, 'USE_DISTRIBUTED': False, 'USE_MOTION': False, 'USE_MOTSYNTH': None, 'VISUALIZE': False, 'WEIGHT_DECAY': 0.0001}
Submit seq: MOT17-12-SDP: 0%| | 0/900 [00:00<?, ?it/s]/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/TensorShape.cpp:3190.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Submit seq: MOT17-12-SDP: 9%|█████▎ | 85/900 [00:16<02:37, 5.16it/s]
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 121, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 106, in main
submit(config=config)
File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 223, in submit
submitter.run()
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/sunzhaojie/MeMOTR/submit_engine.py", line 73, in run
res = self.model(frame=frame, tracks=tracks)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/memotr.py", line 133, in forward
outputs, init_reference, inter_references, inter_queries = self.transformer(
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/deformable_transformer.py", line 225, in forward
memory = checkpoint(self.encoder, src_flatten, spatial_shapes, level_start_index,
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 251, in checkpoint
return _checkpoint_without_reentrant(
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 420, in _checkpoint_without_reentrant
output = function(*args, **kwargs)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 59, in forward
output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/deformable_encoder.py", line 124, in forward
src2 = self.self_attn(self.with_pos_embed(src, pos),
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/MeMOTR/models/ops/modules/ms_deform_attn.py", line 129, in forward
output = self.output_proj(output)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

Is the reason for this error that my graphics card is out of memory? Or do I need to modify something? Thank you for taking the time to answer my question. Thank you!

Answer 6 · 2023-12-15T13:19:36.000Z

For inference, our model usually needs about 2GB CUDA memory. Is there any other program running on the same GPU? And you can use some tools to monitor the CUDA memory usage (like gpustat -i).

BTW, do you have a memory (RAM) overflow problem (you can use htop to monitor your RAM usage)?

Answer 7 · 2024-01-12T13:36:31.000Z

As I haven't received your reply for a long time, I am closing this issue temporarily.
You can re-open this issue if you need~

main.py FAILED

Root Cause (first observed failure): [0]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 2 (local_rank: 2) exitcode : 1 (pid: 3949689) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

main.py FAILED

Failures: [1]: time : 2023-12-15_19:59:43 host : ubuntu-Precision-7920-Tower rank : 1 (local_rank: 1) exitcode : 1 (pid: 4019227) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-12-15_19:59:43 host : ubuntu-Precision-7920-Tower rank : 0 (local_rank: 0) exitcode : 1 (pid: 4019226) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3949689)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
[1]:
time : 2023-12-15_19:59:43
host : ubuntu-Precision-7920-Tower
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 4019227)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-15_19:59:43
host : ubuntu-Precision-7920-Tower
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4019226)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html