polixir/NeoRL

Action space difference between datasset and environment

Closed this issue · 2 comments

Hi, our team are training our model with NeoRL and find Action space difference between datasset and environment.

When excuting code below:

env = neorl.make('Citylearn')
low_dataset, _ = env.get_dataset(
    data_type="low",
    train_num=args.traj_num,
    need_val=False,
    use_data_reward=True,
)
action_low = env.action_space.low
action_high = env.action_space.high
print('action_low', action_low)
print('action_high', action_high)
print('dataset action_low', np.min(low_dataset['action'], axis=0))
print('dataset action_low', np.max(low_dataset['action'], axis=0))`

the output is bwlow , and action range is obviously different between dataset and env, which makes us confused.

action_low [-0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334 -0.33333334
 -0.33333334 -0.33333334]
action_high [0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334 0.33333334
 0.33333334 0.33333334]
dataset action_low [-3.5973904 -4.031006  -3.167992  -3.1832075 -3.4287922 -3.9067357
 -3.4079363 -3.3709202 -3.1863866 -4.1262846 -3.6601577 -4.087899
 -3.8954997 -3.312598 ]
dataset action_low [3.4334774 3.8551078 3.4849963 3.7777936 3.6103873 3.9329555 3.7596557
 3.7149396 4.0387006 3.3615265 3.946596  4.272308  3.4278386 3.3716872]

///

Thanks for reporting this! In a word, we will RE-COLLECT NEW DATASETS for CityLearn soon.

PS:
We checked the codes for data collection, and found the difference only came from the stochastic actions in the data-collection process, i.e., the actions in training are all in the action space, and the mean of the Gaussian policy is bounded by the action space too. However, when we sample from the Gaussian policy during collecting data, the action stored may go beyond the scope, and we just executed that action in the environment and stored the raw sampled action.

We also checked current datasets and found that except CityLearn, there are about 2-5% of actions in the data do not fall into the corresponding action space. However, for CityLearn, the ratio is about 30-60%. We found the action space in CityLearn will change in different cities, while the action space for other domains does not change.

Nevertheless, all the domains admit actions that are not in the given action space. In CityLearn, these actions may achieve larger rewards while they usually result in lower rewards in MuJoCo. We thus treat the action space as a recommended action space rather than a hard constraint. So we decide not to remove these actions when the ratio is small and collect a newer version of datasets for CityLearn.

Hi, we have add CityLearn-v1 in NeoRL where we just clip the action to [-2/3,2/3] for all dimensions. Though this is a looser upper bound, it is a one-line fix to the above issue. We have also re-collected the datasets with a deterministic policy.

Hope this will help!