[BUG] CUDA out-of-memory issue when creating a custom game environment with MultiaSyncDataCollector
Ironman9527 opened this issue · 4 comments
Describe the bug
Unless I set the environment variable as os.environ["CUDA_VISIBLE_DEVICES"] = "", I encounter a CUDA out-of-memory issue when creating a custom game environment with MultiaSyncDataCollector using num_workers = 8 and num_collectors = 8, even though all my devices are set to CPU.
To Reproduce
import torchrl
def get_collector(
num_workers,
num_collectors,
actor_explore,
frames_per_batch,
total_frames,
gamma,
is_fork,
device,
):
# We can't use nested child processes with mp_start_method="fork"
if is_fork:
cls = SyncDataCollector
env_arg = make_env(parallel=True, num_workers=num_workers)
else:
cls = MultiaSyncDataCollector
env_arg = [
make_env(parallel=True, num_workers=num_workers)
] * num_collectors
data_collector = cls(
env_arg,
policy=actor_explore,
frames_per_batch=frames_per_batch,
total_frames=total_frames,
# this is the default behaviour: the collector runs in ``"random"`` (or explorative) mode
exploration_type=ExplorationType.RANDOM,
# We set the all the devices to be identical. Below is an example of
# heterogeneous devices
device=device, # cpu
storing_device=device, # cpu
split_trajs=False,
postproc=MultiStep(gamma=gamma, n_steps=5),
)
return data_collector
Traceback (most recent call last):
Process _ProcessNoWarn-6:5:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.11/dist-packages/torchrl/_utils.py", line 668, in run
return mp.Process.run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.11/dist-packages/torchrl/envs/batched_envs.py", line 1748, in _run_worker_pipe_shared_mem
root_shared_tensordict = shared_tensordict.exclude("next")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensordict/base.py", line 6007, in exclude
result = self._exclude(*keys, inplace=inplace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensordict/_td.py", line 2538, in _exclude
result = TensorDict(
^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tensordict/_td.py", line 268, in __init__
self._sync_all()
File "/usr/local/lib/python3.11/dist-packages/tensordict/base.py", line 6824, in _sync_all
torch.cuda.synchronize()
File "/usr/local/lib/python3.11/dist-packages/torch/cuda/__init__.py", line 792, in synchronize
return torch._C._cuda_synchronize()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Expected behavior
My machine has 80 logical CPUs and 8 GPUs (2080 Ti). I aim to utilize as many environments as possible for training. However, I encounter GPU memory limitation issues when initiating multiple environments. With 64 environments, training proceeds normally, with full GPU memory usage but no errors, and CPU usage at around 60%. Exceeding this number of environments results in GPU OOM issues.
System info
My environment is an Ubuntu Docker image with Python 3.11, Torch version 2.3.1+cu121, and TorchRL 0.4.
Interesting that synchronize creates an OOM, but I think this call is indeed unnecessary (in this specific case). I can patch that