microsoft/AutonomousDrivingCookbook

MemoryError. When train DistributedRL after an hour.

JazzTao opened this issue · 1 comments

Your issue may already be reported! Please make sure to search all open and closed issues before starting a new one.

Please fill out the sections below so we can understand your issue better and resolve it quickly.

Problem description

When I train DistributedRL:
https://github.com/Microsoft/AutonomousDrivingCookbook/blob/master/DistributedRL/LaunchLocalTrainingJob.ipynb

it works at first. After about one hour,I get the error below.
(PS: Actually I changed “threshold=np.nan” to "threshold=sys.maxsize" which in the "distributed_agent.py" Line 609 to let it work at the first time I run “train.bat”. I don't know if it matters.)

My english is not very good. I don't know if I express it clearly.

Problem details

Start time: 2019-04-15 07:23:33.036246, end time: 2019-04-15 07:23:45.755073
Percent random actions: 0.10204081632653061
Num total actions: 98
Generating 98 minibatches...
Sampling Experiences.
Publishing AirSim Epoch.
Publishing epoch data and getting latest model from parameter server...
Traceback (most recent call last):
File "distributed_agent.py", line 643, in
agent.start()
File "distributed_agent.py", line 80, in start
self.__run_function()
File "distributed_agent.py", line 175, in __run_function
self.__publish_batch_and_update_model(sampled_experiences, frame_count)
File "distributed_agent.py", line 401, in __publish_batch_and_update_model
gradients = self.__model.get_gradient_update_from_batches(batches)
File "E:\File\Train_Airsim\AD_Cookbook_AirSim\python36_DRL\Share\scripts_downpour\app\rl_model.py", line 135, in get_gradient_update_from_batches
post_states = np.array(batches['post_states'])
MemoryError

Experiment/Environment details

  • Tutorial used: DistributedRL
  • Environment used: neighborhood
  • Versions of artifacts used (if applicable): tensorflow 1.13.1 ;Python 3.6.2,;Keras 2.1.2;numpy 1.16.2
    *The state of my harddisk:C:9.48GB available,E : 25.4GB (DistributedRL‘sworkspace) available
    *My computer equipment:GPU-GTX960M-4G ,Memory-8G,CPU i5-6300HQ

What is your solution? I came up with the same issue on the newest version. Instead of changing “threshold=np.nan” to "threshold=sys.maxsize", i changed it to "threshold=np.inf" in order to run the script without error coming.

Thank U!