Testing in Headless True vs False has a 40% reward difference when using *the same trained policy*. Reproducible script provided!
Opened this issue · 0 comments
Overview
While building a custom robotic simulation tool on top of OIGE we discovered that testing policies with headless=False
was different from headless=True
. The issue can be easily reproduced even on standard OIGE tasks.
Testing the same trained policy with headless=True/False
has a 40% reward difference on Humanoid
and Ant
tasks.
I am attaching a script that can be run on the latest commit in main
, it trains the Humanoid
task in headless=True
, tests it in headless=True/False
and should produce following results:
== Humanoid Test; headless=True
av reward: 6852.170803435147 av steps: 989.174072265625
== Humanoid Test; headless=False
av reward: 4273.75024558347 av steps: 984.9992679355784
gist to reproduce this: https://gist.github.com/Demetrio92/c986493cff3b4d791a42412179ec6264
This also happens to Ant
. And if training is done with headless=False
(very slow, but can be done) the test scores are entirely different. See extra outputs at the bottom of this post.
Root-Cause Analysis
- We were able to trace this behavior to the state of the internal
to_render
variable inomniisaacgymenvs/envs/vec_env_rlgames.py
. link- there is some convoluted logic for when and how it is being set, but if overloaded to always be
False
the results are always the same asheadless=True
, and if it is overloaded toTrue
the results are always sameheadless=False
- the combination
headless=True
&to_render=True
can be tested by enabling cameras viatask.sim.enable_cameras=True
reproduce.sh
from the gist, as well as outputs at the bottom of this post show that the results withheadless=True & task.sim.enable_cameras=True
are exactly equivalent toheadless=False
- there is some convoluted logic for when and how it is being set, but if overloaded to always be
- Unfortunately, from there the issue goes deep into isaac-sim code via
self._world.step(render=to_render)
link, so we stopped investigating there. - The issue has been tested on multiple machines with different hardware, using latest drivers as well as recommended version 525. But given that everything runs in the docker, this probably should not matter too much.
Resolution
It would be great if you could confirm the issue, or explain if this behavior is expected and what is the proper way to deal with it?
Currently it seems that visually inspecting a trained policy is unreliable as it behaves differently when rendered, which would be extremely undesirable as visual inspection is vital to debugging RL policies.
Extra Results
- Humanoid trained with
headless=True
== Humanoid Test; headless=True
av reward: 6852.170803435147 av steps: 989.174072265625
== Humanoid Test; headless=False
av reward: 4273.75024558347 av steps: 984.9992679355784
== Humanoid Test; headless=True enable_cameras=True ==
av reward: 4273.75024558347 av steps: 984.9992679355784
- Humanoid trained with
headless=False
(training takes 1.5h on RTX 3070)
== Humanoid Test; headless=True
av reward: 4156.822625699561 av steps: 830.9899344569288
== Humanoid Test; headless=False
av reward: 3556.779703811363 av steps: 966.6001461988304
== Humanoid Test; headless=True enable_cameras=True ==
av reward: 3556.779703811363 av steps: 966.6001461988304
- Ant trained with
headless=True
== Ant Test; headless=True
av reward: 7147.375523806955 av steps: 965.1955620580346
== Ant Test; headless=False
av reward: 3829.089754253626 av steps: 996.640625
== Ant Test; headless=True enable_cameras=True ==
av reward: 3829.089754253626 av steps: 996.640625
On request we can also provide complete training and testing logs.