carla-simulator/leaderboard

Memory leak issue when loading a new world

aaronh65 opened this issue ยท 15 comments

I'm trying to create a custom RL environment for CARLA using leaderboard_evaluator.py as a template, but I'm running into some issues when trying to reset the custom environment (after an episode is done). The functions that load world/scenario, and cleanup after a scenario's end closely matches what's done in leaderboard_evaluator.py (e.g. the load_and_wait_for_world and cleanup functions), but there's a memory leak somewhere that happens every time I reset the environment.

Using a memory profiler shows that each time the environment resets, the carla.Client takes up more and more memory. This eventually leads to an out-of-memory error which kills the process. Is there a clean up method I'm missing/some common pitfall when resetting environments that I should resolve to stop this from happening?

I can provide code if needed, but I wanted first check if this was a known issue.

Quick update: I used some memory profiler tools and something I've seen is that every time I load a new world (in my environment reset method) using self.client.load_world(config.town) where config is a ScenarioConfiguration, memory usage jumps by 50-100 Mb and doesn't go back down by the same amount when I call my cleanup method. I tried looking for any hanging references to the world, but I can't seem to find anything.

I decided to run the memory profiler tool through leaderboard_evaluator.py, and it looks like the memory used increases with each successive scenario that's run, specifically within the _load_and_wait_for_world method. As shown in the graph below, each call to this method adds at least 50 Mb in memory used that doesn't seem to be reclaimed after the scenario's conclusion.

leaderboard_memory_leak

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

If you don't have to run the 100 route scenarios all at once, perhaps you could do something hacky with bash scripting. I think I'd try to figure out the approximate number of times you can load a new world before you get an OOM error, and setup your code to run for that number of route scenarios (around 20 in your case). Then, you could use some bash scripting to run the code multiple times to get through all the route scenarios you need e.g. for 100 route scenarios, loop 5 times running 20 route scenarios per loop.

The memory that gets used up by loading new worlds seems relatively consistent and is freed up when you finish running code, so I think looping it with a bash script should work

Hey @aaronh65. Thanks for the information. We've also detected the issue and are trying to solve it. I'll report back when we have some answers to this issue.

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

Are there any updates on this? I have been encountering the issue with memory growing on the self.client.load_world(config.town) call.

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method,
e.g:

  1. transfuser/blob/main/leaderboard/team_code/auto_pilot.py
  2. leaderboard/autoagents/autonomous_agent
    but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

I also met this problem but it's not so often,
In my script terminal

terminate called after throwing an instance of 'clmdep_msgpack::v1::type_error'
  what():  std::bad_cast

Or Carla terminal:

terminating with uncaught exception of type clmdep_msgpack::v1::type_error: std::bad_castterminating with uncaught exception of type clmdep_msgpack::v1::type_error: std::bad_cast

Signal 6 caught.
Signal 6 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=6
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=120416 LargeMemoryPoolOffset=251552 
Engine crash handling finished; re-raising signal 6 for the default handler. Good bye.
Aborted

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method, e.g:

  1. transfuser/blob/main/leaderboard/team_code/auto_pilot.py
  2. leaderboard/autoagents/autonomous_agent
    but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

auto_pilot.py does not load any model checkpoint on the GPU so it does not need a destroy() method. If you are running a model on the GPU then you need to clear the checkpoint memory after every route explicitly in the code, e.g. using destroy() method.

You'll encounter this problem only when running on a large number of routes (<10 shouldn't be an issue). If you want to verify this, you can try running an evaluation with say 50 routes and monitor GPU memory usage after every route. Also, we noticed this issue in CARLA 0.9.10 and earlier version of the leaderboard framework, so I don't know if this has been resolved in the newer versions.

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method, e.g:

  1. transfuser/blob/main/leaderboard/team_code/auto_pilot.py
  2. leaderboard/autoagents/autonomous_agent
    but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

auto_pilot.py does not load any model checkpoint on the GPU so it does not need a destroy() method. If you are running a model on the GPU then you need to clear the checkpoint memory after every route explicitly in the code, e.g. using destroy() method.

You'll encounter this problem only when running on a large number of routes (<10 shouldn't be an issue). If you want to verify this, you can try running an evaluation with say 50 routes and monitor GPU memory usage after every route. Also, we noticed this issue in CARLA 0.9.10 and earlier version of the leaderboard framework, so I don't know if this has been resolved in the newer versions.

Thanks for pointing out. I think not only the GPU memory grown but also the CPU Memory, and according to commit in this repo, I think this problem did't solve... I tried to use python library to collect useless memory but it seems didn't help the situation.
I tried the 100-120 routes file, and the things just like this issue author said, the CPU Memory growning every time after loading the world...

@glopezdiest Any updates on this ?

Not really, we are working in a new Leaderboard 2.0, and while this issue is part of our plan to improve the leaderboard, we haven't had a chance to check it out yet.

Hi, I'm trying to run a Reinforcement Learning experiment using the DynamicObjectCrossing example scenario on Carla 0.9.13. I've used the psutil python module as a memory profiler and I can see the RAM memory usage go up by a constant amount every-time self.client.load_world(town) is called. I've tried to del self.world before the new world is loaded in, but the problem still persists. Has anyone found a solution to this?