microsoft/AirSim-NeurIPS2019-Drone-Racing

AirSim randomly freezes

yannbouteiller opened this issue · 18 comments

(updated)

Hello again.

Sorry this never ends, but there is another recurrent bug with AirSim that makes the simulator freeze usually after maybe ~ 10000 - 20000 calls to the control API. I thought this was related to the bug with west when one drone crossed the last gate but apparently it is not.

The problem with this one is that the simulator doesn't crash, the image freezes, the window becomes unresponsive, but subsequent calls to the API seem to work until we call reset(), which hangs until timeout.

Even weirder, if we don't catch the TimeoutError from msgpckrpc (and so the python program exits with an error), then the window unfreezes.

Hmm, can you help us reproduce this?
Is this script enough - https://github.com/microsoft/AirSim-NeurIPS2019-Drone-Racing/blob/master/tests/test_reset.py?
I ran this for 1 hour (https://github.com/microsoft/AirSim-NeurIPS2019-Drone-Racing/blob/master/tests/test_reset.py#L84) last time.

Can you give us a minimal function which replicates the calls you have?

Also, I modded your reset function slightly. simPause(False) is called before simResetRace(). That might be source of this as well - not certain

https://github.com/madratman/AirSim-NeurIPS2019-Drone-Racing/blob/tests/one_drone_end_race_and_reset_called/baselines/repro_crash_one_drone_ends_race_and_reset_called.py#L7

def dummy_reset(baseline_racer, drone_names, tier):
    baseline_racer.airsim_client.simPause(True)
    baseline_racer.airsim_client.reset()
    for drone_name in drone_names:
        baseline_racer.airsim_client.enableApiControl(vehicle_name=drone_name)
        baseline_racer.airsim_client.arm(vehicle_name=drone_name)
    baseline_racer.airsim_client.simPause(False) # unpause sim to simresetrace works as it's supposed to
    baseline_racer.airsim_client.simResetRace()
    baseline_racer.airsim_client.simStartRace(tier=tier)

Also, I modded your reset function slightly. simPause(False) is called before simResetRace(). That might be source of this as well - not certain

No, I have also made this modification in my original code.

I'll try to reproduce this in a minimal example.

Well, this is super difficult to reproduce, I ran a testing algorithm that performed random steps all night long without any issue, it only seems to happen when I am using my training framework, after a somewhat constant number of time steps. I guess I'd better build a workaround for this.

What happens is that the process coresponding to the

AirSim/AirSimExe/Binaries/Linux/AirSimExe AirSimExe -windowed

command keeps eating 108% CPU on my machine, but the window and API become unresponsive and the GPU seems to stop being used. After a while, the pop up appears saying

"AirSimExe" is not responding. You may choose to wait a short while for it to continue of force the application to quit entirely.

Hmm, this sounds like a lot of fun.
Can you look at and share your last unreal engine binary log (not racelog). It should be in $ ls -ltr AirSim/AirSimExe/Saved/Logs. It may have something useful.

Does the CPU usage increase over time, or is it always constant?

I think we'll have to log the linux process stats - cpu, ram, gpu usage et cetera. In general, https://github.com/Syllo/nvtop is useful.

Let me see how we can log all this info. haven't done this before. I'll keep editing this comment as I find something useful/
It would be awesome if python's subprocess module has an api to do this.

Edits:

Well, I have some more info but not very useful I'm afraid. Actually the freeze doesn't happen on reset(), it happens before that: the window freezes but the simulator is responsive, until we call reset() or SimResetRace(), which hangs forever.

I'm using glances to monitor CPU/memory/Disk usage. The GPU (780m) I use on my laptop to crash-test my algorithms is too old for nvidia-smi, so nvtop doesn't work, I use CPU-X instead for monitoring GPU activity.

When the freeze happens, the GPU usage drops to 0, and the CPU usage of the simulator drops from ~ 300% to ~ 150 % when reset() or simResetTrack() are called after this (then the program hangs).

Unreal logs and an example output of psrecord (simulator's PID) are attached. The output of psrecord has weird scale for some reason but the shape of the plot corresponds to what I see in glances: CPU usage drops when hanging on reset().

logs.zip

So, as a workaround for this issue, when this happens I kill and restart the simulator, which apparently works well, I even use reset() again with no issue, until I make a call to the control API. Then I get the following error (f is the future object returned by the control API):

in f.join()
File "/home/yann/anaconda3/lib/python3.7/site-packages/msgpackrpc/future.py", line 22, in join
self._loop.start()
File "/home/yann/anaconda3/lib/python3.7/site-packages/msgpackrpc/loop.py", line 22, in start
self._ioloop.start()
File "/home/yann/anaconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 755, in start
raise RuntimeError("IOLoop is already running")
RuntimeError: IOLoop is already running

Ah, that seems like the previous Python process is zombieing out in the background. Hence, the tornado error saying ioloop is already running. If you can grab it with a script and kill it along with the unreal binary, that should do it.

Hi @madratman , the problem is, I don't kill the python process because it is responsible for holding the environment, so it keeps running. How can I reinitialize the API ?

Calling directly the stop() method of airsim_client.client._loop doesn't work. When I do this before restarting the simulator and creating a new MultirotorClient, reset() will work but subsenquent call to the control API will hold forever instead of throwing the exception.

This might feel icky, but I'd try generating a new multirotor client object. I am not immediately sure if this solves your problem - but you can test it by opening a new python process when the freeze happens, instantiate a multirotorclient object, and see if confirmconnection returns true and you're able to control the drone.
If that works, then you can create a utility function in your script to generate a new airsim client when this freeze happens.

I am looking at the tornado's API as well - close() seems worth trying. see https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.close
note you should be on tornado v4.5.3
The api doc above might be for newer versions. I see close() in L261 of v4.5.3
edit: I guess it's L714, in the PollIOLoop derived class

I already have a function that instantiates a new MultirotorClient. The problem is that tornado doesn't really allow me to use it several times within the same process, because it keeps a global instance of its IOLoop.

hmm doing some quick digging for the global ioloop you just mentioned:

the following snippet sounds interesting from the tornado's ioloop constructor doc https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop

By default, a newly-constructed IOLoop becomes the thread’s current IOLoop, unless there already is a current IOLoop. This behavior can be controlled with the make_current argument to the IOLoop constructor: if make_current=True, the new IOLoop will always try to become current and it raises an error if there is already a current instance. If make_current=False, the new IOLoop will not try to become current.

https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop

msgpackrpc/loop.py has the wrapper class over tornado's api. which is calling make_current()

    def __init__(self, loop=None):
        self._ioloop = loop or ioloop.IOLoop()
        self._ioloop.make_current()
        self._periodic_callback = None

cc @sytelus any insights?

(btw when I create a new MultirotorClient(), confirmConnection() returns "Connected", but I am not able to control the drone because I get the "RuntimeError: IOLoop is already running" exception when I call join() )

Part of an answer for the workaround seems to be setting the timeout_value of MultirotorClient() to a small number (e.g. 10) and to catch the TimeoutError exception on calls to reset() to decide whether to restart the simulator.

Doing this almost works. After the simulator has been killed and reinstantiated, with a new MultirotorClient, reset() works, arm() returns True (enableApiControl always returns None so I don't know about it), BUT when I call the control API (moveByRollPitchYawrateZAsync().join() in this case), join() hangs until timeout (10s instead 0.1s which would be the value of duration) and no control is applied.

Everything else seems to work, I get the kinematics etc.

I have updated the description of the freeze. I noticed something weird: when the freeze happens in the simulator window, if I kill the python program that launched the simulator and that runs the API, then the window unfreezes.

Okay, I have finally succeeded at writing a minimal example script that reproduces the bug. Apparently, this happens when the simulator is launched from the python program with subprocess.Popen (which I need to do because my environment manages several parallel instances or AirSim):

test_randomcontroller.zip

For me the freeze happens after ~ 30 episodes

I think I found the problem, rather than:

p = subprocess.Popen(EXECUTABLE_PATH + ' ' + OPTIONS, stdout=subprocess.PIPE, shell=True, preexec_fn=os.setpgrp)

I should use:

p = subprocess.Popen(EXECUTABLE_PATH + ' ' + OPTIONS, stdout=None, shell=True, preexec_fn=os.setpgrp)

Seems like I got rid of the freeze like this. I suppose AirSim was printing a ton of garbage that was caught in this subprocess.PIPE which I don't really know what it is, and at some point there was an overflow? I still cannot succeed at restarting the simulator, but hopefully without this freeze I won't need to do this anymore.