Distributed Multi-GPU Training did not decrease training time

Question

Distributed Multi-GPU Training did not decrease training time

Opened this issue 2 years ago · 1 comments

When I tried distributed training for 2 RTX A100 GPU's with batch size of 4 images per GPU, the training time did not decrease.

When I change batch size to 8 images per GPU, I get this error:

Traceback (most recent call last):
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/envs/apdetection1/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Answer 1 · 2024-05-17T08:55:34.000Z

Hi. Have you managed to resolve this issue? I am currently experiencing the same problem where using multiple GPUs results in each GPU having the same memory usage as when using a single GPU. If you have any solutions or suggestions, could you please share them with me? Thank you very much!