How to implement reset() function for Procgen environments?

Question

How to implement reset() function for Procgen environments?

hfeniser opened this issue 4 years ago · 13 comments

In my use case, I investigate the behaviors of agents on custom action policies (i.e. actions determined by me, not the agent). For each run of an action policy I have to create the environment from scratch and it takes almost 2/3 of all my process.

I am assuming that if environment's reset() function would be implemented then it would be much faster. Is there is specific reason for not implementing it? If no, can you help me implementing I?

Answer 1 · 2020-05-04T22:07:20.000Z

I measured 177 microseconds for a step of coinrun and 20 milliseconds for re-creating the environment. What kind of speeds are you getting? How often are you resetting the environment?

Answer 2 · 2020-05-05T11:05:23.000Z

After a few optimizations my line profiler gives me the following result.

I think my results (shown in microseconds) agree with your time measurement. Currently, creating environment for every run takes 1/3 of my whole process.
create_environment() function consists of the following two lines:

env = ProcgenEnv(num_envs=1, env_name=self.env_name, num_levels=self.num_levels, start_level=self.start_level, distribution_mode=self.distribution_mode)
env = VecExtractDictObs(env, "rgb")

And run() function is nothing but playing the game with a custom action policy for some number of steps. Then, agent continue playing from that point.

This is some other point but, actually, I don't understand why I get spurious rewards if I don't create the environment from scratch for each custom action policy. I was thinking that setting the first observation to level's initial observation would work.

Answer 3 · 2020-05-05T17:20:46.000Z

I don't understand the spurious rewards thing, do you have a short example script that demonstrates the issue?

Answer 4 · 2020-05-06T14:28:56.000Z

I figured out the reason for spurious rewards while I am working on an example. Apparently, the environment keeps a counter for each step taken at the background end yields 0 reward when it is 500 in maze game. If I don't create the environment from scratch for each run, it can give me 0 reward in some unexpected situation.

I resolved my issue in some other way, currently I don't need reset() function. But I believe, it would be good to have it for future references.

Answer 5 · 2020-05-15T05:15:21.000Z

The spurious rewards thing sounds like a bug, is that a bug in procgen?

Answer 6 · 2020-05-17T12:45:08.000Z

Below, I explained a scenario. I consider a maze level that can be generated with num_levels=1 and start_level=137. Its original initial state can be seen here.

I create a custom action policy as follows: [5, 5, 8, 7, 6, 6, 7, 7, 3, 7, 3, 5, 1]. This action policy brings the agent to a particular state from the initial state then agent continues playing from that state. The corresponding code for running the game is below:

def run(prefix):
  obs = initial_obs
  while True:
    if idx < len(prefix):
      act = prefix[idx]
    else:
      act = self.model.step(obs)
    obs, rew, done, info = self.env.step(act)
    if done: return

Here, initial_obs is determined just after the environment created with env.reset().
But before running game I call the following function:

def get_last_observation(self, prefix):
  obs = None
  for act in prefix: obs, _, _, _ = self.env.step(act)
  return obs.flatten()

I use the returned observation for a separate analysis. Then, I run the game, but the agent starts in a position different from the initial_obs as seen here.
As one can see, the initial state in this case corresponds to the state led by the custom action policy. But note that I make this assignment obs = initial_obs before running the game. Thus, this strange initial state is the cause for the spurious rewards.

Answer 7 · 2020-05-17T21:44:36.000Z

I ran the following script but the initial obs always appears to be the same, as well as the last obs:

import imageio


def get_last_observation(env, prefix):
  obs = None
  for act in prefix:
      obs, _, _, _ = env.step(act)
  return obs

env = gym.make("procgen:procgen-maze-v0", num_levels=1, start_level=137)
initial_obs = env.reset()
imageio.imsave("initial.png", initial_obs)
prefix = [5, 5, 8, 7, 6, 6, 7, 7, 3, 7, 3, 5, 1]
last_obs = get_last_observation(env, prefix)
imageio.imsave("last.png", last_obs)

env = gym.make("procgen:procgen-maze-v0", num_levels=1, start_level=137)
initial_obs = env.reset()
imageio.imsave("initial2.png", initial_obs)
last_obs = get_last_observation(env, prefix)
imageio.imsave("last2.png", last_obs)

Answer 8 · 2020-05-17T21:44:50.000Z

Do you have a script that reproduces the issue?

Answer 9 · 2020-05-17T22:29:29.000Z

I attached a script.
maze_test.txt

Actually, I think this is neither a bug nor an issue. This is just a demonstration of my need for env.reset(). Previously, for some silly reason I thought that the assignment in 25th line would bring the agent to the original initial state.

Answer 10 · 2020-05-18T00:08:49.000Z

Okay, we're unlikely to add a reset() method any time soon, though you are free to add that in your own fork of course. I'll leave this open until the next release, which will have slightly faster environment creation.

Answer 11 · 2020-05-25T21:24:10.000Z

According to the check here, apparently, one can force the environment to be reseted by sending an action -1 .

Answer 12 · 2020-05-25T23:28:19.000Z

Good find, @kcobbe is there any downside to resetting using that?

Answer 13 · 2020-06-03T16:36:00.000Z

Closing this since it looks like there already is a way to reset the environment, and also environment creation should be faster now.