System resources issues
avolny opened this issue · 1 comments
Hi @xhujoy,
thanks for creating and sharing this code, I immensely appreciate it!
I have some questions though. On what system have you been running the simulations?
I run it on a desktop with configuration: i5-6500 (4 cores), 8 GB RAM and GTX-1060 6GB,
because so far I don't have access to decent servers to run it on.
however, for running the original 16 threads my setup seems largely insufficient.
-
The first clear problem is that I don't have a 16 core CPU, so when running more than 3-4 parallel
threads, the simulation is really slow. Is there some way to make running more threads on a quadcore more feasible? -
The second issue is RAM, it seems that the program is using a lot of RAM and when the simulation
runs it just takes up more and more RAM. Is this expected behavior? Or are there some resources
that could be freed during the runtime but are not? -
The third issue is with GPU VRAM when the simulation is started, TF allocates 4GB of memory on the GPU, however as the simulation progresses, after some time TF starts allocating more and more VRAM until it eventually crashes the whole simulation because it consumes all GPU memory. Once again, is this an inevitable behavior or is it possible to turn something off so that it just uses the memory that is allocated on start-up?
All of these issues have to do with the system resources of my PC and the fact that the A3C algorithm
requires running a lot of instances in parallel. Is it hopeless or is there some way, to make the A3C algorithm work on a lower-tier machine? It seems to me that in my case it is not optimal by design
because it is parallel and actually requires many threads in order to approximate drawing i.i.d. samples during training (if I understood the algorithm idea correctly). So that means that running the simulation with 3 or 4 parallel threads should bear much worse results than running a 16 thread one because the samples will be a lot more correlated.
Once again, am I doomed to either, getting access to a better machine or switching to DeepQ learning or is there some way in which it is possible to efficiently use the A3C algorithm even on a lesser machine?
Thanks for your answer in advance!
- The original paper uses 64 threads. I test this code with 64 threads on my computer and it is slow as well, so I use 16 threads and you can try to run the code with less threads.
- PPO is a more effective drl algorithm and requires less system resources, but you may need to write some code and test the algorithm by yourself.