geoelements/gns

RuntimeError: CUDA out of memory

jpvantassel opened this issue · 1 comments

Summary

gns runs out of GPU memory after ~1300 training steps when training on the deepmind Sand dataset.

Step to Reproduce

download Sand dataset and stage on scratch
launch job on rtx node
start venv on Frontera
run training

Logs

2022-01-04 16:31:09.932142: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-04 16:31:22.916150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13561 MB memory:  -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:02:00.0, compute capability: 7.5
2022-01-04 16:31:23.007940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13563 MB memory:  -> device: 1, name: Quadro RTX 5000, pci bus id: 0000:03:00.0, compute capability: 7.5
2022-01-04 16:31:23.009356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 13563 MB memory:  -> device: 2, name: Quadro RTX 5000, pci bus id: 0000:82:00.0, compute capability: 7.5
2022-01-04 16:31:23.011015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 13563 MB memory:  -> device: 3, name: Quadro RTX 5000, pci bus id: 0000:83:00.0, compute capability: 7.5
context:  {'particle_type': <tf.Tensor 'Reshape_1:0' shape=(None,) dtype=int64>, 'key': <tf.Tensor 'ParseSingleSequenceExample/ParseSequenceExample/ParseSequenceExampleV2:3' shape=() dtype=int64>}
features:  {'position': <tf.Tensor 'Reshape:0' shape=(321, None, 2) dtype=float32>}
device = cuda
2022-01-04 16:31:34.652143: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 0/20000000. Loss: 8.577290534973145.
...
Training step: 315/20000000. Loss: 0.31448671221733093.
2022-01-04 16:32:04.644613: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 316/20000000. Loss: 0.6382215619087219.
...
Training step: 655/20000000. Loss: 0.23831556737422943.
2022-01-04 16:32:34.660503: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 656/20000000. Loss: 0.18804389238357544.
...
Training step: 1011/20000000. Loss: 0.42434051632881165.
2022-01-04 16:33:04.707757: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 1012/20000000. Loss: 0.19644755125045776.
...
Training step: 1359/20000000. Loss: 0.25821176171302795.
2022-01-04 16:33:34.750882: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 1360/20000000. Loss: 0.6185128092765808.
...
Training step: 1380/20000000. Loss: 0.2339397817850113.
Traceback (most recent call last):
  File "/opt/apps/intel19/python3/3.9.2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/apps/intel19/python3/3.9.2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/train.py", line 460, in <module>
    app.run(main)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/train.py", line 454, in main
    train(simulator)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/train.py", line 355, in train
    pred_acc, target_acc = simulator.predict_accelerations(
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/learned_simulator.py", line 282, in predict_accelerations
    predicted_normalized_acceleration = self._encode_process_decode(
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 404, in forward
    x, edge_features = self._processor(x, edge_index, edge_features)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 292, in forward
    x, edge_features = gnn(x, edge_index, edge_features)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 173, in forward
    x, edge_features = self.propagate(
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch_geometric/nn/conv/message_passing.py", line 309, in propagate
    out = self.message(**msg_kwargs)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 198, in message
    edge_features = self.edge_fn(edge_features)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/functional.py", line 1299, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 15.75 GiB total capacity; 1002.27 MiB already allocated; 15.31 MiB free; 1018.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

Resolved by forcing tensorflow dataloader onto CPU.