ikostrikov/pytorch-a2c-ppo-acktr-gail

leveraging parallel environments for sampling faster

Opened this issue · 0 comments

It seems to me that your implementationis is not leveraging the usage of parallel environments very much, but I'm not sure. Please correct me if I'm wrong.

My understanding is that one hyperparameter can be the number of samples we obtain before each doing agent update. assuming we know the optimal value of this hyperparameter, using parallel environments should let us gather these samples faster.

If for example we need 4096 samples before each agent update, it seems to me that your implementation gathers 4096*num_processes before each update, and I wasn't sure if this necessarily boosts the learning speed.

for step in range(args.num_steps):

one workaround could be for example changing the for step in range(args.num_steps): to a while loop that checks the total number of samples that is gathered from all environments.

nevertheless, I would appreciate hearing your view about this :)

-------edit:
I tried changing the for loop to a while loop like the following:

    step = 0
    while step < args.num_steps:

        with torch.no_grad():
            value, action, action_log_prob, recurrent_hidden_states = actor_critic.act(
                rollouts.obs[step], rollouts.recurrent_hidden_states[step],
                rollouts.masks[step])
        obs, reward, done, infos = envs.step(action)
        for info in infos:
            if 'episode' in info.keys():
                episode_rewards.append(info['episode']['r'])])
        masks = torch.FloatTensor([[0.0] if done_ else [1.0] for done_ in done])
        bad_masks = torch.FloatTensor([[0.0] if 'bad_transition' in info.keys() else [1.0] for info in infos])
        rollouts.insert(obs, recurrent_hidden_states, action, action_log_prob, value, reward, masks, bad_masks)

        step += 1 * args.num_processes

but the performance was't as good as if I had only used one environment.