Issue running with GPU on Google CoLaboratory

Question

Issue running with GPU on Google CoLaboratory

thomastiotto opened this issue 4 years ago · 4 comments

I’m trying to run a custom NengoDL model on Google CoLaboratory but after around 15 seconds of simulation the following is thrown. I’m using nengo-dl 3.2.1.dev0 and tensorflow 2.3.0.

Using run optimisation
Simulating with mPES()
Backend is nengo_dl
Build finished in 0:00:01
Optimization finished in 0:00:00
| # Constructing graph | 0:00:002020-08-12 12:04:58.480715: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Construction finished in 0:00:02

Running discretised step 1 of 1
| Simulating | 0:00:03WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_v1.py:2070: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_v1.py:2070: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
| Simulating | 0:00:03
| Simulating # | 0:00:182020-08-12 12:05:19.037688: W tensorflow/core/common_runtime/bfc_allocator.cc:431] Allocator (GPU_0_bfc) ran out of memory trying to allocate 7.47MiB (rounded to 7836928)requested by op TensorGraph/while/iteration_0/SimmPESBuilder/cond/cond/Where_1
Current allocation summary follows.
2020-08-12 12:05:19.043930: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ****************************************************************************************************
2020-08-12 12:05:19.043984: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at where_op.cc:327 : Resource exhausted: OOM when allocating tensor with shape[244904,4] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Traceback (most recent call last):
File "experiments/test_builder_mPES.py", line 120, in
sim.run( sim_time / simulation_discretisation )
File "/content/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/nengo-dl/nengo_dl/simulator.py", line 1106, in run
self.run_steps(steps, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/nengo/utils/magic.py", line 181, in call
return self.wrapper(self.wrapped, self.instance, args, kwargs)
File "/content/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/nengo-dl/nengo_dl/simulator.py", line 74, in require_open
return wrapped(*args, **kwargs)
File "/content/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/nengo-dl/nengo_dl/simulator.py", line 1157, in run_steps
data, n_steps=actual_steps, stateful=stateful
File "/usr/local/lib/python3.6/dist-packages/nengo/utils/magic.py", line 181, in call
return self.wrapper(self.wrapped, self.instance, args, kwargs)
File "/content/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/nengo-dl/nengo_dl/simulator.py", line 74, in require_open
return wrapped(*args, **kwargs)
File "/content/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/nengo-dl/nengo_dl/simulator.py", line 736, in predict_on_batch
"predict_on_batch", x=x, n_steps=n_steps, stateful=self.stateful, **kwargs
File "/usr/local/lib/python3.6/dist-packages/nengo/utils/magic.py", line 181, in call
return self.wrapper(self.wrapped, self.instance, args, kwargs)
File "/content/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/nengo-dl/nengo_dl/simulator.py", line 58, in with_self
output = wrapped(*args, **kwargs)
File "/content/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/Learning-to-approximate-functions-using-niobium-doped-strontium-titanate-memristors/nengo-dl/nengo_dl/simulator.py", line 1044, in _call_keras
outputs = getattr(self.keras_model, func_type)(**func_args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_v1.py", line 1214, in predict_on_batch
outputs = self.predict_function(inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 3825, in call
run_metadata=self.run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[244904,4] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node TensorGraph/while/iteration_0/SimmPESBuilder/cond/cond/Where_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[TensorGraph/Identity/_71]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[244904,4] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node TensorGraph/while/iteration_0/SimmPESBuilder/cond/cond/Where_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Answer 1 · 2020-08-12T12:58:01.000Z

Looks like the Colab instance just doesn't have enough memory to run the model. The 15 seconds of initial simulation time is likely just the initial optimization that happens the first time you execute a TensorFlow graph, then as soon as it starts trying to actually run the model it runs out of memory.

Answer 2 · 2020-08-12T13:19:54.000Z

So it's definitely an issue with colab not allocating me enough memory on the GPU?

Answer 3 · 2020-08-12T13:32:24.000Z

Hard to say for sure, I'm just going off of this error message:

(1) Resource exhausted: OOM when allocating tensor with shape[244904,4] and type int64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

If you want to double check you could run on a local machine (where I presume this runs without error), and check how much memory it consumes. Then see if that is larger than the amount of GPU memory you have available when running on Colab.

Answer 4 · 2020-09-29T16:13:40.000Z

Closing since this seems like a Colab/environment issue, but let us know if that is not the case and we can reopen!