Long simulations exhaust GPU resources
atait opened this issue · 9 comments
Probe data is transferred from GPU to host only at the end of a simulation's run_steps
. The device has to allocate space for it. For example, a Probe on 1M neurons for 1k timesteps would require 1M x 8k x float32 = 32GB, a full GPU's worth.
This can be solved by changing this
sim.run_steps(8000)
to
for _ in range(80):
sim.run_steps(100)
The first one gives an error:
(0) Resource exhausted: OOM when allocating tensor with shape[8000,1,1000000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I think you can reproduce that error by taking any stateful network and increasing the simulation steps. It would be helpful to catch that and give the user some more insight.
The second one takes a long time because control is returning to the python interpreter. You always have to return at some point to append the probe data; however, currently, I don't see an option to avoid doing all the checks and standardizations and progress bar launching within sim.run_steps
, sim.predict_on_batch
, and sim._call_keras
on each iteration.
Am I missing something that allows you to efficiently break up the simulation steps? I thought the unroll_simulation
argument was it, but that seems to be different. Is there an option in keras to break up the steps, accumulate the data, and return the full data at the end? I'm a novice with keras.
If there is no option to do that, then one idea would be a new method that does those checks once, then has a loop containing only this line
nengo-dl/nengo_dl/simulator.py
Line 1050 in 9fb7854
this block
nengo-dl/nengo_dl/simulator.py
Line 1176 in 9fb7854
and something like
self._n_steps += steps_per_iter
Nengo OCL has an example of that loop here. It can run full speed basically until virtual memory is exhausted
https://github.com/nengo-labs/nengo-ocl/blob/5be4f5416ea4b2564fca1f5ec75bf28bfcc03829/nengo_ocl/simulator.py#L620
Probe data is transferred from GPU to host only at the end of a simulation's run_steps. The device has to allocate space for it. For example, a Probe on 1M neurons for 1k timesteps would require 1M x 8k x float32 = 32GB, a full GPU's worth.
This is kind of an unavoidable aspect of how TensorFlow/Keras are set up. It is designed for situations where you pass a bunch of data to the device, run some computations, and then fetch the results back to the host. The solution you describe (running the simulation in a loop) is the recommended solution, allowing you to customize the simulation length based on your performance/memory needs.
The second one takes a long time because control is returning to the python interpreter. You always have to return at some point to append the probe data; however, currently, I don't see an option to avoid doing all the checks and standardizations and progress bar launching within sim.run_steps, sim.predict_on_batch, and sim._call_keras on each iteration.
The overhead of repeatedly launching simulations is primarily within the getattr(self.keras_model, func_type)(**func_args)
call (i.e. it is a property of the underlying TensorFlow/Keras engine), it isn't the other code in e.g. _call_keras
that is slowing things down. This relates to the same point above, that TensorFlow/Keras (and therefore by extension NengoDL) are optimized for executing large computational blocks; it is not very efficient for running a lot of small computational blocks. If you have seen some non-trivial overhead in the _call_keras
function let us know and we can look into optimizing it.
One thing to keep in mind is that if you don't need the Probe output from every timestep, you can use the Probe.sample_every
parameter to subsample the output. In theory this would reduce the memory usage, but due to the current implementation constraints you don't actually see that memory saving until the end of the run_steps
block (as discussed here https://forum.nengo.ai/t/nengodl-high-memory-usage/1264/5?u=drasmuss). Do you need the probe output at every timestep, or would sample_every
work for you? If so we could prioritize adding the on-device subsampling, since we have a couple reported use cases.
Oh I should also mention that some of the suggestions here https://www.nengo.ai/nengo-dl/tips#speed-improvements should help with running the simulation in a loop (in particular if you can fit an unrolled model in memory then unroll_simulation
and use_loop=False
should reduce your overhead; inference_only=True
is always a good idea if you aren't training the model).
You can also use the keep_history
option in nengo_dl.configure_settings
to have probes just store the last output (you can further configure this for individual probes; see the docs). If you combine this with an integrative filter (e.g. a long lowpass filter or the perfect integrator nengo.LinearFilter([1], [1, 0])
), then you can measure e.g. neuron firing rates over the whole simulation without having to store the history.
One thing to keep in mind is that if you don't need the Probe output from every timestep, you can use the Probe.sample_every parameter to subsample the output.
An on-device version of sample_every
would give a new and useful knob in the tradeoff, but it will not solve this issue. Say a user can give up 90% of the data, then their simulation could run 10x longer without looping. If a user can't give up any data, and they want to run 1000x longer, then they still have to use the loop approach.
You can also use the
keep_history
option innengo_dl.configure_settings
to have probes just store the last output
That might be the way to go to keep device memory at minimum. It's kind of the opposite of on-device sample_every
. Are you suggesting the user do a loop like this?
nengo_dl.configure_settings(keep_history=False)
for _ in range(n_steps / like_sample_every):
sim.run_steps(like_sample_every)
The overhead of repeatedly launching simulations is primarily within the
getattr(self.keras_model, func_type)(**func_args)
call (i.e. it is a property of the underlying TensorFlow/Keras engine),
I did some profiling on this. Most of the overhead occurs in keras (like you said), unless the number of steps is 16 or less, in which case Nengo DL becomes the bottleneck. The change is in keras. Here's the test:
Experiment
This command runs 500k neurons ten times, then picks the methods of interest out of the profile report.
python -m cProfile -s tottime benchmark_wattsstrogatz.py dl 500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000 | egrep "(just_the_keras_call|_call_keras|predict_on_batch|run_steps|cumtime)"
I put that "getattr" line into its own method (just_the_keras_call
), so it would be seen by the profiler.
Output with one iteration of 1000 steps
ncalls tottime percall cumtime percall filename:lineno(function)
20 0.337 0.017 74.348 3.717 simulator.py:915(just_the_keras_call)
20 0.000 0.000 74.009 3.700 training.py:1767(predict_on_batch)
20 0.000 0.000 74.385 3.719 simulator.py:1117(run_steps)
20 0.000 0.000 74.356 3.718 simulator.py:918(_call_keras)
20 0.000 0.000 74.357 3.718 simulator.py:689(predict_on_batch)
Output with 100 iterations of 10 steps
ncalls tottime percall cumtime percall filename:lineno(function)
1010 0.104 0.000 94.549 0.094 simulator.py:915(just_the_keras_call)
1010 0.053 0.000 95.578 0.095 simulator.py:918(_call_keras)
1010 0.053 0.000 98.510 0.098 simulator.py:1117(run_steps)
1010 0.031 0.000 93.873 0.093 training.py:1767(predict_on_batch)
1010 0.011 0.000 95.715 0.095 simulator.py:689(predict_on_batch)
Some more data points:
iterations 1 10 25 50 100 500 1000
steps per iteration 1000 100 40 20 10 2 1
time spent in nengo DL <.3% 1.7% 3.5% 5.1% 59% 61% 60%
When using the loop approach, 60% of that time is spent in Nengo DL, outside keras, compared to <0.3% when not looping.
Here's a zoom in on that threshold, using a different number of total time steps for divisibility:
iterations 62 62
steps per iteration 17 16
time spent in nengo DL 5.8% 58%
16 steps
ncalls tottime percall cumtime percall filename:lineno(function)
630 0.048 0.000 89.716 0.142 simulator.py:915(just_the_keras_call)
630 0.024 0.000 90.174 0.143 simulator.py:918(_call_keras)
630 0.023 0.000 91.552 0.145 simulator.py:1117(run_steps)
630 0.016 0.000 89.404 0.142 training.py:1767(predict_on_batch)
630 0.004 0.000 90.232 0.143 simulator.py:689(predict_on_batch)
17 steps
ncalls tottime percall cumtime percall filename:lineno(function)
630 0.648 0.001 101.920 0.162 simulator.py:915(just_the_keras_call)
630 0.013 0.000 102.886 0.163 simulator.py:1117(run_steps)
630 0.013 0.000 102.160 0.162 simulator.py:918(_call_keras)
630 0.012 0.000 101.133 0.161 training.py:1767(predict_on_batch)
630 0.002 0.000 102.191 0.162 simulator.py:689(predict_on_batch)
Keras is significantly faster for 16 compared to 17 steps. Perhaps there is a cost to coordinate multiple registers or something like that. When keras is at that 16-step sweet spot, Nengo DL accounts for a majority of the overhead. It has the effect of dulling the threshold unless you are specifically looking for it. Also, I think it motivates development of a version of run_steps
that does not have all the overhead.
I think this is just an artifact of the way you are printing the profiling data. I'm guessing you've got something like
def just_the_keras_call():
getattr(self.keras_model, func_type)(**func_args)
when you look at the tottime
of just_the_keras_call
in that case, you're only measuring the time inside that function not including the time inside the inner keras function (i.e., basically just the overhead of calling just_the_keras_call
). So that number isn't actually telling you how much time you're spending inside the Keras engine.
If you look at the cumtime
you can get a better picture, as that's telling you how much time is spent inside each function including subcalls. So in your case with e.g. 16 steps, you're spending 91.552
seconds in run_steps
, and 89.716
seconds of that is inside just_the_keras_call
. That means that you're spending 1.836s (i.e. 2%) in NengoDL.
Yes, I think you are right. I was thinking of keras as this opaque call, which can be lumped in with the calling function by cprofile. Of course, that's not the case with keras because it has some non-compiled python before getting into primitive and/or compiled calls. cprofile records those, but they're missed by my grep. Thanks for pointing that out.
There is still this odd and statistically significant threshold between 16 and 17 steps, but we're talking about much smaller 10% instead of 60% differences. What I'm getting from this discussion of overhead is that Nengo DL has a lot of interpreted overhead, but it's not a big deal because keras' is even worse. Is that fair?
I think the 16 vs 17 steps thing is just an environmental quirk on your machine (or at least, it doesn't show up on my machine). There is also a lot less overhead in general on my machine. Here is what it looks like for me with
16 steps
ncalls tottime percall cumtime percall filename:lineno(function)
248/62 0.000 0.000 5.194 0.084 ...\nengo\nengo\utils\magic.py:174(__call__)
124/62 0.000 0.000 5.194 0.084 ...\nengo-dl\nengo_dl\simulator.py:65(require_open)
62 0.002 0.000 5.193 0.084 ...\nengo-dl\nengo_dl\simulator.py:1118(run_steps)
62 0.000 0.000 5.182 0.084 ...\nengo-dl\nengo_dl\simulator.py:689(predict_on_batch)
124/62 0.002 0.000 5.181 0.084 ...\nengo-dl\nengo_dl\simulator.py:49(with_self)
62 0.001 0.000 5.179 0.084 ...\nengo-dl\nengo_dl\simulator.py:915(_call_keras)
62 0.172 0.003 5.154 0.083 ...\nengo-dl\nengo_dl\simulator.py:1050(just_the_keras_call)
vs 17 steps
ncalls tottime percall cumtime percall filename:lineno(function)
232/58 0.000 0.000 5.127 0.088 ...\nengo\nengo\utils\magic.py:174(__call__)
116/58 0.000 0.000 5.127 0.088 ...\nengo-dl\nengo_dl\simulator.py:65(require_open)
58 0.001 0.000 5.126 0.088 ...\nengo-dl\nengo_dl\simulator.py:1118(run_steps)
58 0.000 0.000 5.116 0.088 ...\nengo-dl\nengo_dl\simulator.py:689(predict_on_batch)
116/58 0.001 0.000 5.115 0.088 ...\nengo-dl\nengo_dl\simulator.py:49(with_self)
58 0.001 0.000 5.113 0.088 ...\nengo-dl\nengo_dl\simulator.py:915(_call_keras)
58 0.171 0.003 5.090 0.088 ...\nengo-dl\nengo_dl\simulator.py:1050(just_the_keras_call)
Note that that is ~0.7% overhead in both cases.
Here is how I am timing things (just doing it in-code without any grepping, for one 500000 neuron run) in case that makes a difference in your results
with sim_class(model, **sim_kwargs) as sim:
t_sim = time.time()
# -- warmup
sim.run(0.01)
t_warm = time.time()
# -- long-term timing
with cProfile.Profile() as pr:
# sim.run(simtime)
steps_per_iteration = 16
for _ in range(n_steps // steps_per_iteration):
sim.run_steps(steps_per_iteration)
t_run = time.time()
pstats.Stats(pr).sort_stats("cumtime").print_stats(20)
So based on my results anyway, I don't think I'd say it's the case that NengoDL has a lot of interpreted overhead, but I'd be interested why your environment seems to have more (although still <2% overhead in the worst case).
Thank you for doing your own tests! Sorry for the delayed response. I agree, based on the cumtime measurement, that Nengo DL is not introducing significant additional overhead. I also learned how to use NVIDIA visual profiler. Not sure how it compares to the TF profiler, but it gives some pretty pictures that correspond to the results above.
This is one iteration of the for
loop. Keras python overhead is greater than Nengo DL's. There's also something going on in the TF kernel where data is being transferred to host every timestep. I have inference_only=True
, so I don't know what it could be. Do you have an idea? Do you have a CUDA-level profiler that reproduces it?
Suppose you did eliminate that kernel dead time, then the Keras python overhead associated with looping would be significant, although it amortizes with the number of steps per loop.
I just found this interesting. I think we can conclude the original issue though. Perhaps a simple try/except block would be helpful for future Nengo DL users. Thanks again.
Just an update, we do plan to add some documentation to help users diagnose/fix memory issues to help resolve this, it'll just be a little while before we have a chance to get to that I suspect.