RuntimeError: CUDA out of memory
jpvantassel opened this issue · 1 comments
jpvantassel commented
Summary
gns runs out of GPU memory after ~1300 training steps when training on the deepmind Sand
dataset.
Step to Reproduce
download Sand
dataset and stage on scratch
launch job on rtx node
start venv on Frontera
run training
Logs
2022-01-04 16:31:09.932142: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-04 16:31:22.916150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13561 MB memory: -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:02:00.0, compute capability: 7.5
2022-01-04 16:31:23.007940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13563 MB memory: -> device: 1, name: Quadro RTX 5000, pci bus id: 0000:03:00.0, compute capability: 7.5
2022-01-04 16:31:23.009356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 13563 MB memory: -> device: 2, name: Quadro RTX 5000, pci bus id: 0000:82:00.0, compute capability: 7.5
2022-01-04 16:31:23.011015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 13563 MB memory: -> device: 3, name: Quadro RTX 5000, pci bus id: 0000:83:00.0, compute capability: 7.5
context: {'particle_type': <tf.Tensor 'Reshape_1:0' shape=(None,) dtype=int64>, 'key': <tf.Tensor 'ParseSingleSequenceExample/ParseSequenceExample/ParseSequenceExampleV2:3' shape=() dtype=int64>}
features: {'position': <tf.Tensor 'Reshape:0' shape=(321, None, 2) dtype=float32>}
device = cuda
2022-01-04 16:31:34.652143: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 0/20000000. Loss: 8.577290534973145.
...
Training step: 315/20000000. Loss: 0.31448671221733093.
2022-01-04 16:32:04.644613: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 316/20000000. Loss: 0.6382215619087219.
...
Training step: 655/20000000. Loss: 0.23831556737422943.
2022-01-04 16:32:34.660503: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 656/20000000. Loss: 0.18804389238357544.
...
Training step: 1011/20000000. Loss: 0.42434051632881165.
2022-01-04 16:33:04.707757: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 1012/20000000. Loss: 0.19644755125045776.
...
Training step: 1359/20000000. Loss: 0.25821176171302795.
2022-01-04 16:33:34.750882: W tensorflow/core/framework/dataset.cc:744] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
Training step: 1360/20000000. Loss: 0.6185128092765808.
...
Training step: 1380/20000000. Loss: 0.2339397817850113.
Traceback (most recent call last):
File "/opt/apps/intel19/python3/3.9.2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/apps/intel19/python3/3.9.2/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/train.py", line 460, in <module>
app.run(main)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/train.py", line 454, in main
train(simulator)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/train.py", line 355, in train
pred_acc, target_acc = simulator.predict_accelerations(
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/learned_simulator.py", line 282, in predict_accelerations
predicted_normalized_acceleration = self._encode_process_decode(
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 404, in forward
x, edge_features = self._processor(x, edge_index, edge_features)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 292, in forward
x, edge_features = gnn(x, edge_index, edge_features)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 173, in forward
x, edge_features = self.propagate(
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch_geometric/nn/conv/message_passing.py", line 309, in propagate
out = self.message(**msg_kwargs)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/gns/graph_network.py", line 198, in message
edge_features = self.edge_fn(edge_features)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 98, in forward
return F.relu(input, inplace=self.inplace)
File "/work2/04709/vantaj94/frontera/tacc/projects/cognitasium/gns/venv/lib/python3.9/site-packages/torch/nn/functional.py", line 1299, in relu
result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 15.75 GiB total capacity; 1002.27 MiB already allocated; 15.31 MiB free; 1018.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
jpvantassel commented
Resolved by forcing tensorflow dataloader onto CPU.