microsoft/AutonomousDrivingCookbook

DistributeRL agent performance nowhere close to gif

wonjoonSeol opened this issue · 4 comments

Problem description

Improving reward function for DistributedRL

Problem details

RL agent

This gif file in the readme, is this the result of running the base tutorial codes you provided without any modifications? The local training results (> 5 days on GTX 980) just for running the tutorial codes are nowhere close to this performance.

  • The agent zig-zag too much, there is no goal position for the agent to reach, and reward doesn't necessarily punish zig-zag as long as the agent follows the defined track lines in rewards.txt
  • Always crash in the shadows. I intend to "modify the brightness of random patches of the input image to simulate additional "shadows" ". As suggested from another issue.
  • Running sample_model.json also doesn't show any similar results.

But just wondering why the gif model performs so well without implementing that?

Experiment/Environment details

  • Tutorial used: DistributedRL
  • Environment used: Neighborhood

As per #85, I have updated code for the latest Airsim binary and currently re-training there. I will see if this makes any difference.

I'm not sure why your model is not performing well. The trained model in the gif did come from the provided code, although it was trained on a cluster using the distributed method.

Running RunModel with sample_model.json without loading any weights shows gif performance with some reset - it still crashes every now and then.

But when I try to train on top of the said model by loading sample_model.json and train further on the local machine actually makes the performances a lot worse. I am not loading any weights just the model, I have commented out the part that loads pretrained weights for the conv layers. As the checkpoint output single json only.

Furthermore, time to time I am getting this error message and the training goes in halt:

Getting Pose
Waiting for momentum to die
Resetting
Running car for a few seconds...
Model predicts 0
Traceback (most recent call last):
  File "distributed_agent.py", line 649, in <module>
    agent.start()
  File "distributed_agent.py", line 84, in start
    self.__run_function()
  File "distributed_agent.py", line 164, in __run_function
    experiences, frame_count = self.__run_airsim_epoch(False)
  File "distributed_agent.py", line 323, in __run_airsim_epoch
    state_buffer = self.__append_to_ring_buffer(self.__get_image(), state_buffer, state_buffer_len)
  File "distributed_agent.py", line 465, in __get_image
    image_rgba = image1d.reshape(image_response.height, image_response.width, 4)
ValueError: cannot reshape array of size 1 into shape (0,0,4)

Any idea on this? Why does the image array have different size sometimes?

Yes, the sample_model.json isn't perfect, and will sometimes crash.

Further training won't work. You'll end up overfitting. I noticed while training that if we let the model run for too long, it would start to perform worse. Unfortunately, I don't have a great way of detecting the overfitting rather than stopping once it starts to perform decently.

In regards to the error you are getting - it looks like the exe is occasionally not returning any data. I've tried to repro this locally, but can't get it to happen. A simple fix would be to bail out if we receive an image of size zero.

That's very interesting. Thank you.