NotAnyMike/HRL

Check memory issues with model after several million iterations

Closed this issue · 3 comments

Things we know:

  1. The leak is happening in each environment. So that leaves us with fewer candidates
    • It can be the workers creating the new process
    • It can be something else in the car_racing env (car dynamics, world2d)
  2. Looks like there is no problem with leaky objects on the env (checked using reset)
  • Add a summary of the resources used
    • Memory consumption
    • Difference of memory consumption
    • Total difference with respect to first step taken
  • Try gc.collect()
  • Check leaky objects
  • Check general counts
  • Check the worker: there is a weird behaviour in which if I get the leaky objects the memory consumption is very very high, without it, the leak still happens but at the same levels as normally
    • The number of leaky objects does not increase, but the number of lists does significantly

FInal solution found, until a more stable and low-level solution is found I will use the solution stated in the last comment

  • Raise issue in openai's gym repo
  • Implement my own solution
  • Test

Ideas

  • Print the sys.getsizeof(env) every x steps
  • Adds sys.gettotalrefcount()
  • Check meliae
  • Check Guppy-PE
  • Check http://www.valgrind.org/

The problem regarding the memory leak has to do with the fixtureDef when creating the tiles and adding them to the World2d. Looks like even by properly deleting them from the world, or even deleting the world itself does not solve the problem. Deleting them resolves to a c++ class, looks like the problem is at c++ level, there is no much to do from python.

The issue solution is to take advantage of the multiprocesses running. A process can be killed and then restarted without making any difference, and that is the solution for which I am going.

A better solution which causes minimal memory footprint is to recycle the FixtureDef object. This solution was tested in the Car_racing original env from OpenAI, check openai/gym#1509.
This was also tested in this env, the results are shown below (black is the solution)

image

The last test is in training, once that test is passed this issue should be closed