facebookresearch/d2go

Out of memory when using Pytorch > 2

mathrb opened this issue · 2 comments

Hello,

EDIT: Issue is closed. See latest comment for more information.

I've used d2go for at least a year, based on torch 1.13.1+cu116, detectron2 and d2go at a specific commit.

I now want to benefit from the latest updates, especially since I defiantly want to try the work done on determinist training.
I've update both detectron2 and d2go to their latest version.

The train fails due to this:
ImportError: cannot import name 'ModuleWrapPolicy' from 'torch.distributed.fsdp.wrap' (/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/wrap.py)

As I noticed, this was part of the torch nightly build prior 2.0 (and part of 2.x later on)
Since I can't find those whl anymore (nightly prior 2.0), I tried with a torch version > 2, but this lead to another issue: the exact same experiment that was requiring around 8Go GPU ram now requests more than 450Go of GPU ram and therefore fails.

I wonder what is the up to date way to make a fresh install that works with the current versions of d2go and detectron2
Any help pretty much appreciated.

PS: I also tried with conda, but no luck as it fails on cudart version mismatch

I've been able to have 2 working docker file to reproduce my issue. On that install d2go in an older version and one at the latest commit.
When training a ballon model, like explained in the beginner guide, I don't face any Out Of Memory issue, memory usage is as displayed:

  • old commit: ~2.4 Gb GPU ram
  • new commit: ~1.1 Gb GPU ram

With my own dataset, and the same configuration file used for the balloon dataset, I still experience a OOM with the latest version of d2go:

  • old commit: ~13.3 Gb GPU ram
  • new commit: CUDA out of memory. Tried to allocate 31.56 GiB. GPU 0 has a total capacity of 23.68 GiB of which 22.76 GiB is free

Since only the dataset is the point of change between those 2 experiments, I assume that some code of default configuration has changed and bring this behavior, here is a diff between the output log of the old commit run and new commit run, you'll be able to see the changes of default configuration between the 2

Here is the link to the diff: https://editor.mergely.com/GVZ7uPWL
Otherwise, both original log files are below

no_issue_local.log
oom_local.log

The "old commits":

EDIT: Actually a major change between the 2 tests is the introduction of pytorch 2. The old commits worked perfectly fine with pytorch 1.13.1
Instead of detectron2, that is still working fine with pytorch 1.13.1, latest d2go versions require pytorch2.
I wonder what change, in either d2go or pytorch2, makes this training on a custom dataset impossible for me since it requires way more GPU ram than previously (pytorch 1.13.1 and previous commits of d2go)

I finally found the core issue:
To include determinism in the later versions of d2go and detectron2, I used:
torch.use_deterministic_algorithms(True)
In the recent releases of torchvision, the team worked on determinism: pytorch/vision#8168
As we can see in this comment: Oh you know what, it's probably because of use deterministic algorithms. We added a deterministic implementation but it is very memory hungry
Another person reported this issue here:
pytorch/pytorch#120240