thu-ml/tianshou

Batch: don't just set 0 when elements have None entries

MischaPanch opened this issue · 8 comments

This is essentially a bug due to the current implementation of __setitem__:

image

Setting 0 when something is off is wrong. This can manifest itself in things like info of reset, where certain values might be unknown (like actions or env_num). Then None is erroneously replaced by 0. There are multiple tests covering this erroneous behavior, e.g.,
line 112 in
image

One should instead at least turn this into NaN entries

As far as I can tell, we reach this point in two cases:

  1. Adding an item to the buffer where the new item has an inconsistent structure compared to the already existing structure and
  2. updating an existing item with an item that has an inconsistent structure compared to the already existing structure.

I think the main question here is if these operations should be allowed in the first place. In the RL context this probably only happens in case the info dict changes its structure between steps (as happens in MoveToRightEnv in the tests). One could argue, that the env is ill-defined in this case and instead of setting arbitrary default values, the env should be fixed?

Regarding your suggestion to set values to NaN instead, assigning np.nan will fail on arrays whose dtype is not float (as is the case in the above mentioned test).

@maxhuettenrauch unfortunately, in the RL context this is bound to happen because some things might not be known at reset that are known at step, and thus will contain None entries. The prime example is the action, which is not available at reset, but some entrances from the info might also be missing

This issue is strongly related to #1087

But when would you add obs directly after reset to the buffer (where a Batch object is created) and only after that retrieve an action, call step, and append the rest to this entry?

I thought it might happen in the collectors, but maybe I'm mistaken.

For sure I've seen this happen with info objects somewhere in collector tests - though there it is a bad implementation of the env.

Unfortunately Gymnasium doesn't force any interface on the info dicts which are the main drivers of this problem.We can stop supporting such cases, or ask the use to specify what should happen then explicitly.

I am all for restricting the number of supported operations to decrease complexity and probability of errors :)

Yes, definitely happening in the collector tests, due to said bad env design. I'm gonna check some examples with standard mujoco envs.