pytorch/rl

[BUG] CUDA out-of-memory issue when creating a custom game environment with MultiaSyncDataCollector

Ironman9527 opened this issue · 4 comments

Describe the bug

Unless I set the environment variable as os.environ["CUDA_VISIBLE_DEVICES"] = "", I encounter a CUDA out-of-memory issue when creating a custom game environment with MultiaSyncDataCollector using num_workers = 8 and num_collectors = 8, even though all my devices are set to CPU.

To Reproduce

import torchrl

def get_collector(
    num_workers,
    num_collectors,
    actor_explore,
    frames_per_batch,
    total_frames,
    gamma,
    is_fork,
    device,
):
    # We can't use nested child processes with mp_start_method="fork"
    if is_fork:
        cls = SyncDataCollector
        env_arg = make_env(parallel=True, num_workers=num_workers)
    else:
        cls = MultiaSyncDataCollector
        env_arg = [
            make_env(parallel=True, num_workers=num_workers)
        ] * num_collectors
    data_collector = cls(
        env_arg,
        policy=actor_explore,
        frames_per_batch=frames_per_batch,
        total_frames=total_frames,
        # this is the default behaviour: the collector runs in ``"random"`` (or explorative) mode
        exploration_type=ExplorationType.RANDOM,
        # We set the all the devices to be identical. Below is an example of
        # heterogeneous devices
        device=device, # cpu
        storing_device=device, # cpu
        split_trajs=False,
        postproc=MultiStep(gamma=gamma, n_steps=5),
    )
    return data_collector
Traceback (most recent call last):
  Process _ProcessNoWarn-6:5:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.11/dist-packages/torchrl/_utils.py", line 668, in run
    return mp.Process.run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torchrl/envs/batched_envs.py", line 1748, in _run_worker_pipe_shared_mem
    root_shared_tensordict = shared_tensordict.exclude("next")
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensordict/base.py", line 6007, in exclude
    result = self._exclude(*keys, inplace=inplace)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensordict/_td.py", line 2538, in _exclude
    result = TensorDict(
             ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/tensordict/_td.py", line 268, in __init__
    self._sync_all()
  File "/usr/local/lib/python3.11/dist-packages/tensordict/base.py", line 6824, in _sync_all
    torch.cuda.synchronize()
  File "/usr/local/lib/python3.11/dist-packages/torch/cuda/__init__.py", line 792, in synchronize
    return torch._C._cuda_synchronize()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

My machine has 80 logical CPUs and 8 GPUs (2080 Ti). I aim to utilize as many environments as possible for training. However, I encounter GPU memory limitation issues when initiating multiple environments. With 64 environments, training proceeds normally, with full GPU memory usage but no errors, and CPU usage at around 60%. Exceeding this number of environments results in GPU OOM issues.

System info

My environment is an Ubuntu Docker image with Python 3.11, Torch version 2.3.1+cu121, and TorchRL 0.4.

image
image
When I set _has_cuda to false, training proceeds normally with GPU memory usage around 1GB.
However, I am unsure if this modification is correct.

image
image
Ultimately, I decided to modify this line of code to avoid the issue. I will proceed with training to see if any problems arise.

Interesting that synchronize creates an OOM, but I think this call is indeed unnecessary (in this specific case). I can patch that