PKU-Alignment/Safe-Policy-Optimization

question about env reset

lijie9527 opened this issue · 2 comments

obs, _ = env.reset()
obs = torch.as_tensor(obs, dtype=torch.float32, device=device)
ep_ret, ep_cost, ep_len = (
    np.zeros(args.num_envs),
    np.zeros(args.num_envs),
    np.zeros(args.num_envs),
)
# training loop
for epoch in range(epochs):
    rollout_start_time = time.time()
    # collect samples until we have enough to update
    for steps in range(local_steps_per_epoch):  

Why did your code only perform env.reset() at the beginning, rather than starting at each epoch?

Gaiejj commented

That's because in safepo/common/env.py, we wrap the environment with SafetyAsyncVectorEnv and AutoReset wrapper in Safety-Gymnasium.

def make_sa_mujoco_env(num_envs: int, env_id: str, seed: int|None = None):
    if num_envs > 1:
        # Some code here
        env = SafetyAsyncVectorEnv(env_fns)
    else:
       # Some code here
        env = SafeAutoResetWrapper(env)
       # Some code here
    return env, obs_space, act_space

If you use a single environment, the environment will be reset every episode by AutoReset. If you use a vectorized environment, SafetyAsyncVectorEnv will reset each specific single environment separately.
Additionally, if your custom environment does not support auto reset, please add reset in the level of the algorithms manually.

Thank you for the answer, I understand.