Testing in Headless True vs False has a 40% reward difference when using the same trained policy. Reproducible script provided!

Question

Testing in Headless True vs False has a 40% reward difference when using the same trained policy. Reproducible script provided!

Opened this issue a month ago · 0 comments

Overview

While building a custom robotic simulation tool on top of OIGE we discovered that testing policies with headless=False was different from headless=True. The issue can be easily reproduced even on standard OIGE tasks.
Testing the same trained policy with headless=True/False has a 40% reward difference on Humanoid and Ant tasks.

I am attaching a script that can be run on the latest commit in main, it trains the Humanoid task in headless=True, tests it in headless=True/False and should produce following results:

 == Humanoid Test; headless=True
av reward: 6852.170803435147 av steps: 989.174072265625

 == Humanoid Test; headless=False
av reward: 4273.75024558347 av steps: 984.9992679355784

gist to reproduce this: https://gist.github.com/Demetrio92/c986493cff3b4d791a42412179ec6264

This also happens to Ant. And if training is done with headless=False (very slow, but can be done) the test scores are entirely different. See extra outputs at the bottom of this post.

Root-Cause Analysis

We were able to trace this behavior to the state of the internal to_render variable in omniisaacgymenvs/envs/vec_env_rlgames.py. link
- there is some convoluted logic for when and how it is being set, but if overloaded to always be False the results are always the same as headless=True, and if it is overloaded to True the results are always same headless=False
- the combination headless=True & to_render=True can be tested by enabling cameras via task.sim.enable_cameras=True
- reproduce.sh from the gist, as well as outputs at the bottom of this post show that the results with headless=True & task.sim.enable_cameras=True are exactly equivalent to headless=False
Unfortunately, from there the issue goes deep into isaac-sim code via self._world.step(render=to_render) link, so we stopped investigating there.
The issue has been tested on multiple machines with different hardware, using latest drivers as well as recommended version 525. But given that everything runs in the docker, this probably should not matter too much.

Resolution

It would be great if you could confirm the issue, or explain if this behavior is expected and what is the proper way to deal with it?
Currently it seems that visually inspecting a trained policy is unreliable as it behaves differently when rendered, which would be extremely undesirable as visual inspection is vital to debugging RL policies.

Extra Results

Humanoid trained with headless=True

 == Humanoid Test; headless=True
av reward: 6852.170803435147 av steps: 989.174072265625

 == Humanoid Test; headless=False
av reward: 4273.75024558347 av steps: 984.9992679355784

 == Humanoid Test; headless=True enable_cameras=True == 
av reward: 4273.75024558347 av steps: 984.9992679355784

Humanoid trained with headless=False (training takes 1.5h on RTX 3070)

 == Humanoid Test; headless=True
av reward: 4156.822625699561 av steps: 830.9899344569288

 == Humanoid Test; headless=False
av reward: 3556.779703811363 av steps: 966.6001461988304

 == Humanoid Test; headless=True enable_cameras=True == 
av reward: 3556.779703811363 av steps: 966.6001461988304

Ant trained with headless=True

 == Ant Test; headless=True
av reward: 7147.375523806955 av steps: 965.1955620580346

 == Ant Test; headless=False
av reward: 3829.089754253626 av steps: 996.640625

 == Ant Test; headless=True enable_cameras=True == 
av reward: 3829.089754253626 av steps: 996.640625

On request we can also provide complete training and testing logs.