DLR-RM/stable-baselines3

Roadmap to Stable-Baselines3 V1.0

araffin opened this issue ยท 46 comments

This issue is meant to be updated as the list of changes is not exhaustive

Dear all,

Stable-Baselines3 beta is now out ๐ŸŽ‰ ! This issue is meant to reference what is implemented and what is missing before a first major version.

As mentioned in the README, before v1.0, breaking changes may occur. I would like to encourage contributors (especially the maintainers) to make comments on how to improve the library before v1.0 (and maybe make some internal changes).

I will try to review the features mentioned in hill-a/stable-baselines#576 (and hill-a/stable-baselines#733)
and I will create issues soon to reference what is missing.

What is implemented?

  • basic features (training/saving/loading/predict)
  • basic set of algorithms (A2C/PPO/SAC/TD3)
  • basic pre-processing (Box and Discrete observation/action spaces are handled)
  • callback support
  • complete benchmark for the continuous action case
  • basic rl zoo for training/evaluating plotting (https://github.com/DLR-RM/rl-baselines3-zoo)
  • consistent api
  • basic tests and most type hints
  • continuous integration (I'm in discussion with the organization admins for that)
  • handle more observation/action spaces #4 and #5 (thanks @rolandgvc)
  • tensorboard integration #9 (thanks @rolandgvc)
  • basic documentation and notebooks
  • automatic build of the documentation
  • Vanilla DQN #6 (thanks @Artemis-Skade)
  • Refactor off-policy critics to reduce code duplication #3 (see #78 )
  • DDPG #3
  • do a complete benchmark for the discrete case #49 (thanks @Miffyli !)
  • performance check for continuous actions #48 (even better than gSDE paper)
  • get/set parameters for the base class (#138 )
  • clean up type-hints in docs #10 (cumbersome to read)
  • documenting the migration between SB and SB3 #11
  • finish typing some methods #175
  • HER #8 (thanks @megan-klaiber)
  • finishing to update and clean the doc #166 (help is wanted)
  • finishing to update the notebooks and the tutorial #7 (I will do that, only HER notebook missing)

What are the new features?

  • much cleaner base code (and no more warnings =D )
  • independent saving/loading/predict for policies
  • State-Dependent Exploration (SDE) for using RL directly on real robots (this is a unique feature, it was the starting point of SB3, I published a paper on that: https://arxiv.org/abs/2005.05719)
  • proper evaluation (using separate env) is included in the base class (using EvalCallback)
  • all environments are VecEnv
  • better saving/loading (now can include the replay buffer and the optimizers)
  • any number of critics are allowed for SAC/TD3
  • custom actor/critic net arch for off-policy algos (#113 )
  • QR-DQN in SB3-Contrib
  • Truncated Quantile Critics (TQC) (see #83 ) in SB3-Contrib
  • @Miffyli suggested a "contrib" repo for experimental features (it is here)

What is missing?

  • syncing some files with Stable-Baselines to remain consistent (we may be good now, but need to be checked)
  • finish code-review of exisiting code #17

Checklist for v1.0 release

  • Update Readme
  • Prepare blog post
  • Update doc: add links to the stable-baselines3 contrib
  • Update docker image to use newer Ubuntu version
  • Populate RL zoo

What is next? (for V1.1+)

side note: should we change the default start_method to fork? (now that we don't have tf anymore)

m-rph commented

Maybe N-step returns for TD3 (and DDPG) and DQN (and friends) ? If it's implemented in the experience replay, then it is likely plug and play for TD3 and DQN, an implementation for SAC probably requires extra effort.

Perhaps at a later time, e.g. V1.1+ retrace, tree backup, Q(lambda), Importance Sampling for n-step returns?
If retrace and friends are planned for later, then it should be taken into consideration when implementing n-steps.

@partiallytyped

Yup, that would be v1.1 thing but indeed planned. Should probably go over original SB issues to gather all these suggestions at some point.

m-rph commented

Perhaps a discrete version of SAC for v1.1+?
https://arxiv.org/abs/1910.07207

Edit: I can implement this, and add types to the remaining methods after my finals (early June).

I will start working on the additional observation/action spaces this weekend ๐Ÿ‘

Will the stable baselines 3 repo/package replace the existing stable baselines one, or will all this eventually be merged into the normal stable baselines repo?

@justinkterry

There are no plans to merge/combine the two repositories. Stable-baselines will continue to exist, and continue to receive bug-fixes and the like for some time before it is archived.

@Miffyli Thank you. Will it remain as "pip3 install stable-baselines," or become something like "pip3 install stable-baselines3"?

@justinkterry

You can already install sb3 with pip3 install stable-baselines3. The original repo will stay as pip3 install stable-baselines.

Minor point but I wonder if we should rename BaseRLModel to BaseRLAlgorithm and BasePolicy to BaseModel, given that BasePolicy is more than just a policy?

Minor point but I wonder if we should rename BaseRLModel to BaseRLAlgorithm and BasePolicy to BaseModel, given that BasePolicy is more than just a policy?

Good point, BaseModel and BaseRLAlgorithm are definitely better names ;)

jdily commented

for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option?
So that no need for tensorflow dependency.
I can help to add functions to do that.

for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option?

correct me if I'm wrong but W&B does not work offline, no? This is really important as you don't want your results to be published when you do private work.

This could be also implemented either as a callback (cf doc) or a new output for the logger. But sounds more like a "contrib" module to me.

m-rph commented

Perhaps an official shorthand for stable-baselines and stable-baselines3 e.g. sb and sb3?

import stable_baselines3 as sb3

Is it necessary to continue to provide the interface for vectorized environments inside of this codebase?
They were contributed upstream back to gym in this PR. After that PR was merged, packages such as PyBullet (pybullet_envs) started providing vectorized variants of their own environments using the interface from gym which should be the same as the one here (for now)

@ManifoldFR Somehow that has eluded my attention. Looks like a good suggestion! Less repeated code is better, as long it fits in stable-baselines functions too.

@araffin thoughts (you have most experience doing the eval/wrap functions)? I imagine the hardest part is to update all wrappers that work on vectorized environments.

I happened onto it by chance because it's not documented anywhere inside of gym's docs, the openai people ported it from their own baselines repo with barely any notification of the change to end users.

I was aware of this (wrote some comments at that time https://github.com/openai/gym/pull/1513/files#r293899941) but I would argue against for different reasons:

  • we rely on some specific features (set_attr, get_attr)
  • openai version is undocumented and we don't know if they gonna break that feature (which is central in SB3) in a future release (I don't want to write new monkey patch like hill-a/stable-baselines@678f803)
  • we can directly tweak that feature to fit our needs (and don't wait for a review and release by OpenAI)

So, in short, I would be in favor only if OpenAI way of maintaining Gym was more reliable.

PS: thanks @ManifoldFR for bringing that up ;)

m-rph commented

How about vectorized/stacked action noise for TD3/SAC? I am referring to hill-a/stable-baselines#805 . This will be a useful stepping stone for multiprocessing in TD3/SAC.

The motivation behind this addition is that for OU or other noise processes we need the state to exist until the end of the trajectory, thus we can't use a single process for multiple parallel episodes as they have different lengths. Thus, stacked/vectorized processes allow us to keep the processes for as long as the particular episode goes.

Also the code is done and has types, I'd just change the *lst in the reset function to a single argument over a variadic argument.

How about vectorized/stacked action noise for TD3/SAC?

sounds good. If you have test included, you can do the PR.
It won't be used for a while though... a bit like what we did for dict support inside VecEnv (was not used until HER was re-implemented).

m-rph commented

For v1.1+
Network Randomization appears to be useful and simple enough to implement. I will implement it for a project I am working on (Obstacle Tower) so if you believe it will be useful to have, I can open an issue specifically for it and discuss it there.

Network Randomization appears to be useful and simple enough to implement.

I would rather favor what we did for the pre-training (issue #27 ), create an example code/notebook and link to it in the documentation.

m-rph commented

What do you think about Noisy Linear layers for v1.0?

Noisy Linear layers

What is that?

m-rph commented

It's from this paper. The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.

The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.

I considered that more as an extension to DQN (so v1.1+) but yes, why not (I added it to the roadmap). V1.0 is mainly meant to be the basic algorithms, fully tested and without extension.

In fact, from what I remember, this is very close to exploration in parameter space.
For continuous actions, gSDE is already implemented (see https://arxiv.org/abs/2005.05719), however, such exploration is missing for discrete actions.

m-rph commented

I agree on 1.1, it will be much easier to incorporate once the code is stable. I have it in internal code for PPO and I will be happy to make a PR when the time comes.

Do you think that algorithms like IMPALA or APEX are worth implementing?

Those could be planned for v1.2+ and onwards, as discussed here. The thing is that the structure does not really fit well with the asynch nature of those algorithms.

multi-agent could be an very interesting step forward.
I don't know if it is worth to develop a brand-new code if the structure may not fit this paradigm (the multi-agent, not one specific algorithm or another) that could be appealing for the community, given the fact that there are also few competitors.

Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.

https://github.com/PettingZoo-Team/PettingZoo

Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.

Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.

Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.

https://github.com/PettingZoo-Team/PettingZoo

Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.

Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.

@justinkterry what you are doing is almost spam... You have been posting your project link already in 3 different issues.
We appreciate the effort of making a MA version of SB3 but this will be delayed to a separate issue.
In order to keep this thread clean and focus on the 1.0 roadmap, i will delete your and my comment, feel free to open a new issue for discussion about MA support.

I move the discussion of MA/Distributed agents support to #69
I deleted/hide some comments to re-focus on 1.0 roadmap.

I have some good news: performances of all algorithms are matching the ones from Stable-Baselines (SB2) both in discrete and continuous environments ๐ŸŽ‰ !
See #48 and #49 (with some nice detective work from @Miffyli in #110 )
I think I will add a note in the README to show that SB3 is now trustworthy ;)

I could even reproduce results (in fact have even better ones after the bug fixes) from the gSDE paper

One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization

One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization

The scope of this method is unfortunately very narrow (only applies when you want to do sim2real and have access to the simulator) compared to the amount of changes needed.

@partiallytyped I think you are already aware but I am also mentioning here. I found a source code example for the original SAC Discrete Implementation paper (the one you found).

The author also publicised his code.
https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/master/agents/actor_critic_agents/SAC_Discrete.py

Hope these help,
Sean

I found the original example for the SAC Discrete Implementation plan.
Can the following paper be considered?
https://arxiv.org/abs/1910.07207

We already have an issue for that #157

@araffin I wasn't asking if we can implement it. Given that it has been decided as shown on the load-map and PartialTyped volunteered here, I was giving him or the team a resource. #1 (comment)
Actually, I missed the footnote
need to be discussed, benefit vs DQN+extensions?
I am posting my suggestion in the page now.

First release candidate is out: https://github.com/DLR-RM/stable-baselines3/releases/tag/v1.0rc0
100+ trained rl models will be published soon: DLR-RM/rl-baselines3-zoo#69

Excellent work guys! So is there a full example of A2C (or other) games using a Dict observation space? I tried modifying the test_dict_env.py but end up with a

Traceback (most recent call last): File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 125, in <module> test_dict_spaces(A2C, True) File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 119, in test_dict_spaces model = model_class("MlpPolicy", env, gamma=0.5, seed=1, **kwargs) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/a2c/a2c.py", line 80, in __init__ super(A2C, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 76, in __init__ super(OnPolicyAlgorithm, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 156, in __init__ env = self._wrap_env(env, self.verbose, monitor_wrapper) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 209, in _wrap_env env = ObsDictWrapper(env) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 45, in __init__ dimensions = [venv.observation_space.spaces["observation"].n, venv.observation_space.spaces["desired_goal"].n] KeyError: 'observation'

Error.

If you want a more practical example, see this comment that has code using ViZDoom in it. Other than that there is the documentation that has examples on using and modifying things.

@Miffyli, thank you for your quick reply. Am I understanding this correctly.. You are wrapping your Doom Dict Env in a DummyVecEnv? Why not just pass the Doom Env straight to PPO()?

That is used to gather rollout samples from multiple environments at the same time (leads to stabler training with PPO). SB3 also wraps all environments, even if you only have a single one, into a VecEnv under the hood.

Ok first off... are you a bot? So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv? I'll take a look at the difference between those DummyEnv now. I will try out the effects of wrapping with and without DummyVecEnv/DummyDictEnv with A2C and report back here for future truth seekers.

Ok first off... are you a bot?

Nah, just happen to be around when you are sending messages :)

So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv?

Yes, I see now the issue. These two things are very different: DummyDictEnv is just a testing environment used to test the support for dict-observations. DummyVecEnv is a VecEnv implementation that does not really parallelize anything. There is no connection between the two. You should not use DummyDictEnv (it is only meant for testing), you should use DummyVecEnv (or SubProcVecEnv) when you give environments to SB3 code.