microsoft/AirSim-NeurIPS2019-Drone-Racing

Airsim crashes during training for reinforcement learning

Opened this issue · 17 comments

I am using reinforcement learning to train my model. But the AirSim engine crashes after a few hundreds of episodes. Before crashing its gives the following error multiple times:

error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.

and finally I have this error:

Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=11
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=86336 LargeMemoryPoolOffset=217472 
terminating with uncaught exception of type std::__1::bad_weak_ptr: bad_weak_ptr
Signal 6 caught.
Failed to find symbol file, expected location:
"/home/kaveh/AirSim/AirSim_Training/AirSimExe/Binaries/Linux/AirSimExe.sym"
terminating with uncaught exception of type std::__1::bad_weak_ptr: bad_weak_ptr
Signal 6 caught.
Malloc Size=44187 LargeMemoryPoolOffset=261675 
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Segmentation fault (core dumped)

Are you using moveOnSpline as your "action" - aka your policy is spitting out target waypoints and/or velocities which are being sent to moveOnSpline APIs?
When this error happens, can you try logging the waypoints being sent to moveOnSpline and the drone odometry (from getMultirotorState()) so we can reproduce it.
Or is it that you are using moveOnSpline for taking off?

No, I am using moveByVelocityAsync. I only need to move the drone a little at each step based on output of my policy function, so there is no need to use moveOnSpline. I am sending every 100 millisecond a move action to the drone. I also tried different time steps but always got the same error.

Hmm, so this might be caused by drone_2 then. As that error message is associated with moveOnSpline.
How are you resetting the episode?
Try using this reset function: #94 (comment)

I did it but there was no difference. The only way to prevent this error is to load the environment again, but that makes my training very slow and it is impossible to train a reinforcement learning model with that speed.

@kavehkamali hmm, so to repro this - I'll call the dummy_reset function in a loop by modifying this script : https://github.com/microsoft/AirSim-NeurIPS2019-Drone-Racing/blob/master/tests/test_reset.py. If you have a better way to help us repro this, let me know.
Also, you're on linux and qualification binaries, I assume?
In general, @yannbouteiller did you ever face this post the reset fixes?

Alright, you just need to sleep a bit after the call to simResetRace() and before the call to simStartRace. See updated gist here https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9
When reset is called, the drones lie on top of each other at world origin, then when simResetRace() is called, the drone meshes are teleported to the center of the cages. Now, due to gravity the meshes fall down for a fraction of second before settling down at the bottom.
If you call simStartRace before they settle down, I think the spline fitter is perhaps something weird as current position (I need to look a bit more to see why exactly this happened), but sleeping for 0.5 seconds (could be less) b/w reset and simStartRace is not crashing the sim.
You can see the diff in the gist from the previous comment to this comment here https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9/revisions

I updated the gist https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9 one more time and increased the amount of sleep b/w simResetRace and simStartRace to 1.0. There's no sleep needed b/w reset and simResetRace, so I removed it - see diff here https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9/revisions

With 0.5 s of sleep, I did see the crash happen after some time, but 1 second is proving to be stable for more than half an hour.
Made a little screencast https://www.youtube.com/watch?v=UuCm8Sp3P_U&

Thank you for the update, I will try this.

Is it working for you now?

Oh is this where those crashes come from?

Sleeping for 1.0s is super costy, though.

(Edit: oh, we are apparently not talking about the same crashes)

I am running the simulation on 20 clock, so sleeping 1 second is too much for me.
For now, I removed the competitor drone completely from the API and it works fine when I turn off the graphics in tier 1. Later, I will fine tune for two drones.

@madratman Also, the ip of the competitor drone is hard coded in the API. I had to change the API to be able to set the ip for the competitor drone.

I've update the airsim linux v0.3 training binaries, the pythonclient to 1.1.1, and the gist
now, the drone are reset close to the floor, and reset time is reduced. You can probably go lower than the current 0.1*2 sleeps, but I haven't tried it.
I found that when reset is called when drone_2 hasn't finished taking off (.join() of the takeoff call to drone_2 hasn't returned) the simulator crashes.
So, I am now instead sleeping in the pythonclient 1.1.1.

There seems to be one rare edge case, which seems to happen when reset is called from another thread at the same time when drone_2 is finishing its takeoff and starting the fly_through_all_gates..() call. At that point, I think the sim is freezing.
I saw a sim freeze at reset being called at race time 3.997 seconds.

yep, yann, we were talking about different things. I just tagged you to check if you had seen this

Thanks!

Hi @madratman
After transferring to use AirSim Drone Racing Lab, I still face the same problem as @kavehkamali ! When I'm training agent for RL, AirSim Unreal Engine will crash unpredictable.
On this time, I have already trained 5588 steps in 979 episodes. The Unreal Engine crashed when the quadrotor is taking action, not doing reset between two episodes. The detail is showing below.

image

@kavehkamali May I ask u how to remove the competitor drone in tier 1 environment? and how to turn off the graphic in tier 1?
Looking forward to your reply. Thanks.

@changpowei I had a similar issue where the AirSim Unreal engine kept crashing in the middle of an episode. For me I was making frequent updates to moveOnSplineAsync(), and it would randomly crash. I tried the fix suggested here: #94 (def dummy_reset). I placed the reset function at the end of each episode and now my code runs without crashing AirSim. I just ran it for about 1.5 hrs with no issues.