DLR-RM/stable-baselines3

Performance Check (Discrete actions)

araffin opened this issue · 8 comments

The discrete action counterpart of #48

Associated PR: #110

  • A2C
  • PPO
  • DQN (I'm currently working on that in #28 and it looks good)

Test envs: Atari Games (Pong - easy, Breakout - medium, ...)

Initial results with PPO: Seems to mostly match performance of SB PPO2 but with some glaring errors (see training runs with six games with bit different action spaces). It seems that at least few games should be used for evaluation, because in some sb3 version gets similar performance (e.g. MsPacman, Q*bert), but in others it does not reach same numbers (e.g. Breakout, Enduro). I still have double-check the parameters were right etc.

atari_ppo_sb.pdf
atari_ppo_sb3.pdf

Are you using the zoo? And if so, which wrapper?
You should be using dqn branch for SB3 and the zoo.

Are you using the zoo? And if so, which wrapper?
You should be using dqn branch for SB3 and the zoo.

No Zoo, based on this code. These are copied and modified wrappers from SB. The only thing that changes between SB and SB3 runs is where algorithm is imported from, rest is handled by the other code (and is the same).

m-rph commented

Cross Posting:

Relevant, I am getting some rather weird performance from DQN, it seems to reach 0 fps (it was with num_threads=1, and old polyak update). When using an ensemble of 10 estimators I got much better performance and I can't pinpoint the issue.

image

In the policy, instead of having a single Qnetwork, I have n_estimator identical QNetworks and their estimation is averaged.
Note, this was running on GPU and the environment was LunarLander.

n_estimators is a hyper-parameter for a custom version of DQN that uses an ensemble of n_estimators identical (except the weights) to the QNetwork of DQN.

This is observed with the latest version of DQN.

Hello all,

I've been an avid SB1 user for over a year. An amazing framework with thorough documentation and active support group indeed.
New RL developments have propelled RL to new highs. For ex, Async PPO. Which can scale 3X and more on the same hardware. It is my humble opinion that it may be a good time to start thinking seriously about async. I believe that SB3 will greatly benefit from Async. Making it a strong, viable framework into the future!

@partiallytyped

I will work on DQN next. Could you share what envs/settings you used to get stuck like that with "standard" setup?

@jarlva

This is on the suggestions list for v1.2, I believe. At the moment we are working on optimizing the performance even of the synchronous variants, and PyTorch is not making things too easy with its tendency to use too many threads at the same time etc :)

Completely understand @Miffyli . Would it be helpful to review https://github.com/alex-petrenko/sample-factory

m-rph commented

@Miffyli

The script that runs the DQN agent:

from stable_baselines3 import DQN
import argparse



if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--lr","--learning-rate", type=float, default=1e-4, dest="learning_rate")
    parser.add_argument("env", type=str)
    parser.add_argument("--policy", default="MlpPolicy")
    parser.add_argument("--policy-kwargs", type=eval, default={})
    parser.add_argument("--buffer-size", type=int, default=int(1e5))
    parser.add_argument("--learning-starts", type=int, default=5000)
    parser.add_argument("--batch-size", default=32, type=int)
    parser.add_argument("--tau", type=float, default=1.0)
    parser.add_argument("--gamma", default=0.99, type=float)
    parser.add_argument("--train-freq", type=int, default=4)
    parser.add_argument("--gradient-steps", type=int, default=-1)
    parser.add_argument("--n-episodes-rollout", type=int, default=-1)
    parser.add_argument("--target-update-interval", type=int, default=5000)
    parser.add_argument("--exploration-fraction", type=float, default=0.2)
    parser.add_argument("--exploration-initial-eps", type=float, default=1.0)
    parser.add_argument("--exploration-final-eps", type=float, default=0.05)
    learn = argparse.ArgumentParser()
    learn.add_argument("--n-timesteps", default=int(5e5), type=int, dest="total_timesteps")
    learn.add_argument("--eval-freq", type=int, default=10)
    learn.add_argument("--n-eval-episodes", type=int, default=5)
    agent_args, learn_args = parser.parse_known_args()
    learn_args = learn.parse_args(learn_args)
    
    agent = DQN(**agent_args.__dict__, verbose=2, create_eval_env=True, tensorboard_log=f"tb/dqn_{agent_args.env}")
    agent.learn(**learn_args.__dict__)

The script that I call the above with:

python dqn.py "LunarLander-v2" --n-timesteps=50000 --learning-rate 1e-4 --batch-size 128 --buffer-size 50000 --learning-starts 0 --gamma 0.99 --target-update-interval 1000 --train-freq 4 --gradient-steps -1 --exploration-fraction 0.12 --exploration-final-eps 0.05 --policy-kwargs "dict(net_arch=[256, 256])"

The hyper parameters (except lr) are taken from the zoo.