Roadmap to Stable-Baselines3 V1.0

Question

Roadmap to Stable-Baselines3 V1.0

araffin opened this issue 5 years ago · 46 comments

Answer 1 · 2020-05-08T16:55:31.000Z

Maybe N-step returns for TD3 (and DDPG) and DQN (and friends) ? If it's implemented in the experience replay, then it is likely plug and play for TD3 and DQN, an implementation for SAC probably requires extra effort.

Perhaps at a later time, e.g. V1.1+ retrace, tree backup, Q(lambda), Importance Sampling for n-step returns?
If retrace and friends are planned for later, then it should be taken into consideration when implementing n-steps.

Answer 2 · 2020-05-08T16:57:15.000Z

@partiallytyped

Yup, that would be v1.1 thing but indeed planned. Should probably go over original SB issues to gather all these suggestions at some point.

Answer 3 · 2020-05-08T17:53:23.000Z

Perhaps a discrete version of SAC for v1.1+?
https://arxiv.org/abs/1910.07207

Edit: I can implement this, and add types to the remaining methods after my finals (early June).

Answer 4 · 2020-05-08T18:34:20.000Z

I will start working on the additional observation/action spaces this weekend 👍

Answer 5 · 2020-05-10T05:51:19.000Z

Will the stable baselines 3 repo/package replace the existing stable baselines one, or will all this eventually be merged into the normal stable baselines repo?

Answer 6 · 2020-05-10T09:52:01.000Z

@justinkterry

There are no plans to merge/combine the two repositories. Stable-baselines will continue to exist, and continue to receive bug-fixes and the like for some time before it is archived.

Answer 7 · 2020-05-10T16:16:53.000Z

@Miffyli Thank you. Will it remain as "pip3 install stable-baselines," or become something like "pip3 install stable-baselines3"?

Answer 8 · 2020-05-10T16:21:13.000Z

@justinkterry

You can already install sb3 with pip3 install stable-baselines3. The original repo will stay as pip3 install stable-baselines.

Answer 9 · 2020-05-10T23:05:24.000Z

Minor point but I wonder if we should rename BaseRLModel to BaseRLAlgorithm and BasePolicy to BaseModel, given that BasePolicy is more than just a policy?

Answer 10 · 2020-05-11T07:32:08.000Z

Minor point but I wonder if we should rename BaseRLModel to BaseRLAlgorithm and BasePolicy to BaseModel, given that BasePolicy is more than just a policy?

Good point, BaseModel and BaseRLAlgorithm are definitely better names ;)

Answer 11 · 2020-05-11T10:04:20.000Z

for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option?
So that no need for tensorflow dependency.
I can help to add functions to do that.

Answer 12 · 2020-05-11T10:39:09.000Z

for visualization, probably using something like weights & biases (https://www.wandb.com/) is an option?

correct me if I'm wrong but W&B does not work offline, no? This is really important as you don't want your results to be published when you do private work.

This could be also implemented either as a callback (cf doc) or a new output for the logger. But sounds more like a "contrib" module to me.

Answer 13 · 2020-05-11T12:19:06.000Z

Perhaps an official shorthand for stable-baselines and stable-baselines3 e.g. sb and sb3?

import stable_baselines3 as sb3

Answer 14 · 2020-05-12T12:30:26.000Z

Is it necessary to continue to provide the interface for vectorized environments inside of this codebase?
They were contributed upstream back to gym in this PR. After that PR was merged, packages such as PyBullet (pybullet_envs) started providing vectorized variants of their own environments using the interface from gym which should be the same as the one here (for now)

Answer 15 · 2020-05-12T12:37:08.000Z

@ManifoldFR Somehow that has eluded my attention. Looks like a good suggestion! Less repeated code is better, as long it fits in stable-baselines functions too.

@araffin thoughts (you have most experience doing the eval/wrap functions)? I imagine the hardest part is to update all wrappers that work on vectorized environments.

Answer 16 · 2020-05-12T12:42:20.000Z

I happened onto it by chance because it's not documented anywhere inside of gym's docs, the openai people ported it from their own baselines repo with barely any notification of the change to end users.

Answer 17 · 2020-05-12T12:49:36.000Z

I was aware of this (wrote some comments at that time https://github.com/openai/gym/pull/1513/files#r293899941) but I would argue against for different reasons:

we rely on some specific features (set_attr, get_attr)
openai version is undocumented and we don't know if they gonna break that feature (which is central in SB3) in a future release (I don't want to write new monkey patch like hill-a/stable-baselines@678f803)
we can directly tweak that feature to fit our needs (and don't wait for a review and release by OpenAI)

So, in short, I would be in favor only if OpenAI way of maintaining Gym was more reliable.

PS: thanks @ManifoldFR for bringing that up ;)

Answer 18 · 2020-05-12T13:04:14.000Z

I see, it figures that OpenAI changing things unilaterally without documentation would be a problem. I guess ensuring stable-baselines3's code doesn't break when running vectorized envs derived from Gym's implementation would be safer and easier instead...

Answer 19 · 2020-05-25T11:09:02.000Z

How about vectorized/stacked action noise for TD3/SAC? I am referring to hill-a/stable-baselines#805 . This will be a useful stepping stone for multiprocessing in TD3/SAC.

The motivation behind this addition is that for OU or other noise processes we need the state to exist until the end of the trajectory, thus we can't use a single process for multiple parallel episodes as they have different lengths. Thus, stacked/vectorized processes allow us to keep the processes for as long as the particular episode goes.

Also the code is done and has types, I'd just change the *lst in the reset function to a single argument over a variadic argument.

Answer 20 · 2020-05-25T11:27:34.000Z

How about vectorized/stacked action noise for TD3/SAC?

sounds good. If you have test included, you can do the PR.
It won't be used for a while though... a bit like what we did for dict support inside VecEnv (was not used until HER was re-implemented).

Answer 21 · 2020-05-27T07:50:00.000Z

For v1.1+
Network Randomization appears to be useful and simple enough to implement. I will implement it for a project I am working on (Obstacle Tower) so if you believe it will be useful to have, I can open an issue specifically for it and discuss it there.

Answer 22 · 2020-05-27T08:00:10.000Z

Network Randomization appears to be useful and simple enough to implement.

I would rather favor what we did for the pre-training (issue #27 ), create an example code/notebook and link to it in the documentation.

Answer 23 · 2020-05-27T15:17:03.000Z

What do you think about Noisy Linear layers for v1.0?

Answer 24 · 2020-05-27T15:24:18.000Z

Noisy Linear layers

What is that?

Answer 25 · 2020-05-27T15:35:53.000Z

It's from this paper. The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.

Answer 26 · 2020-05-27T15:47:22.000Z

The tl:dr is that they are linear layers that have parameter noise. This results in better exploration over e-greedy/ entropy bonus. The layers learn a mu and a logstd, then on the forward pass, they sample from the distribution.

I considered that more as an extension to DQN (so v1.1+) but yes, why not (I added it to the roadmap). V1.0 is mainly meant to be the basic algorithms, fully tested and without extension.

In fact, from what I remember, this is very close to exploration in parameter space.
For continuous actions, gSDE is already implemented (see https://arxiv.org/abs/2005.05719), however, such exploration is missing for discrete actions.

Answer 27 · 2020-05-27T16:59:39.000Z

I agree on 1.1, it will be much easier to incorporate once the code is stable. I have it in internal code for PPO and I will be happy to make a PR when the time comes.

Answer 28 · 2020-06-22T13:55:34.000Z

Do you think that algorithms like IMPALA or APEX are worth implementing?

Answer 29 · 2020-06-22T16:49:13.000Z

Those could be planned for v1.2+ and onwards, as discussed here. The thing is that the structure does not really fit well with the asynch nature of those algorithms.

Answer 30 · 2020-06-22T22:34:17.000Z

multi-agent could be an very interesting step forward.
I don't know if it is worth to develop a brand-new code if the structure may not fit this paradigm (the multi-agent, not one specific algorithm or another) that could be appealing for the community, given the fact that there are also few competitors.

Answer 31 · 2020-06-22T23:22:21.000Z

Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.

https://github.com/PettingZoo-Team/PettingZoo

Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.

Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.

Answer 32 · 2020-06-23T04:54:21.000Z

Hey, so I've been managing the development of a library called PettingZoo since January. It's basically a multi-agent version of Gym, with a very Gym like API. It's involved people from UMD, Technical University Berlin, MILA, UT Austin, Google Research, and DeepMind. We even made changes to the ALE to allow playing multi-player Atari games with RL for the first time.

https://github.com/PettingZoo-Team/PettingZoo

Right now it's believed to mostly be in fully working order and is soft released, and we're aiming for a full release in the next month or so. This is something worth consideration in the planning of multi-agent support in stable baselines.

Additionally, me and several of the people at UMD have done a bunch with multi-agent RL, and have been wanting to add support for it to SB3 once it was more production ready. If there's interest in doing official multi-agent support with SB3, we should open a separate thread for the discussion of what that should look like.

@justinkterry what you are doing is almost spam... You have been posting your project link already in 3 different issues.
We appreciate the effort of making a MA version of SB3 but this will be delayed to a separate issue.
In order to keep this thread clean and focus on the 1.0 roadmap, i will delete your and my comment, feel free to open a new issue for discussion about MA support.

Answer 33 · 2020-06-23T08:03:54.000Z

I move the discussion of MA/Distributed agents support to #69
I deleted/hide some comments to re-focus on 1.0 roadmap.

Answer 34 · 2020-08-06T12:18:19.000Z

I have some good news: performances of all algorithms are matching the ones from Stable-Baselines (SB2) both in discrete and continuous environments 🎉 !
See #48 and #49 (with some nice detective work from @Miffyli in #110 )
I think I will add a note in the README to show that SB3 is now trustworthy ;)

I could even reproduce results (in fact have even better ones after the bug fixes) from the gSDE paper

Answer 35 · 2020-08-26T09:09:02.000Z

One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization

Answer 36 · 2020-08-26T09:11:49.000Z

One interesting feature would be to supply different inputs to the policy and value networks, when these are separate. This is the principle behind Asymmetric Actor Critic, which is used at OpenAI to make the critic omniscient and get a somewhat better advantage estimate during policy optimization

The scope of this method is unfortunately very narrow (only applies when you want to do sim2real and have access to the simulator) compared to the amount of changes needed.

Answer 37 · 2020-12-07T09:20:49.000Z

@partiallytyped I think you are already aware but I am also mentioning here. I found a source code example for the original SAC Discrete Implementation paper (the one you found).

The author also publicised his code.
https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/master/agents/actor_critic_agents/SAC_Discrete.py

Hope these help,
Sean

Answer 38 · 2020-12-07T09:27:28.000Z

I found the original example for the SAC Discrete Implementation plan.
Can the following paper be considered?
https://arxiv.org/abs/1910.07207

We already have an issue for that #157

Answer 39 · 2020-12-07T09:42:18.000Z

@araffin ~~I wasn't asking if we can implement it. Given that it has been decided as shown on the load-map and PartialTyped volunteered here, I was giving him or the team a resource. #1 (comment)~~
Actually, I missed the footnote
need to be discussed, benefit vs DQN+extensions?
I am posting my suggestion in the page now.

Answer 40 · 2021-03-05T19:25:26.000Z

First release candidate is out: https://github.com/DLR-RM/stable-baselines3/releases/tag/v1.0rc0
100+ trained rl models will be published soon: DLR-RM/rl-baselines3-zoo#69

Answer 41 · 2021-05-21T21:19:16.000Z

Excellent work guys! So is there a full example of A2C (or other) games using a Dict observation space? I tried modifying the test_dict_env.py but end up with a

Traceback (most recent call last): File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 125, in <module> test_dict_spaces(A2C, True) File "/home/eziegenbalg/Documents/goldengoose_research/stable-baselines3/tests/test_dict_env.py", line 119, in test_dict_spaces model = model_class("MlpPolicy", env, gamma=0.5, seed=1, **kwargs) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/a2c/a2c.py", line 80, in __init__ super(A2C, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 76, in __init__ super(OnPolicyAlgorithm, self).__init__( File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 156, in __init__ env = self._wrap_env(env, self.verbose, monitor_wrapper) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 209, in _wrap_env env = ObsDictWrapper(env) File "/usr/local/lib/python3.9/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 45, in __init__ dimensions = [venv.observation_space.spaces["observation"].n, venv.observation_space.spaces["desired_goal"].n] KeyError: 'observation'

Error.

Answer 42 · 2021-05-21T21:23:22.000Z

If you want a more practical example, see this comment that has code using ViZDoom in it. Other than that there is the documentation that has examples on using and modifying things.

Answer 43 · 2021-05-22T13:31:09.000Z

@Miffyli, thank you for your quick reply. Am I understanding this correctly.. You are wrapping your Doom Dict Env in a DummyVecEnv? Why not just pass the Doom Env straight to PPO()?

Answer 44 · 2021-05-22T13:33:21.000Z

That is used to gather rollout samples from multiple environments at the same time (leads to stabler training with PPO). SB3 also wraps all environments, even if you only have a single one, into a VecEnv under the hood.

Answer 45 · 2021-05-22T13:39:15.000Z

Ok first off... are you a bot? So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv? I'll take a look at the difference between those DummyEnv now. I will try out the effects of wrapping with and without DummyVecEnv/DummyDictEnv with A2C and report back here for future truth seekers.

Answer 46 · 2021-05-22T13:42:54.000Z

Ok first off... are you a bot?

Nah, just happen to be around when you are sending messages :)

So there was a specific reason you wrapped it in DummyVecEnv and not DummyDictEnv?

Yes, I see now the issue. These two things are very different: DummyDictEnv is just a testing environment used to test the support for dict-observations. DummyVecEnv is a VecEnv implementation that does not really parallelize anything. There is no connection between the two. You should not use DummyDictEnv (it is only meant for testing), you should use DummyVecEnv (or SubProcVecEnv) when you give environments to SB3 code.

Roadmap to Stable-Baselines3 V1.0

What is implemented?

What are the new features?

What is missing?

Checklist for v1.0 release

What is next? (for V1.1+)