Performance check (Continuous Actions)

Question

Performance check (Continuous Actions)

araffin opened this issue 4 years ago · 18 comments

Check that the algorithms reach expected performance.
This was already done prior to v0.5 for the gSDE paper but as we made big changes, it is good to check that again.

SB2 vs SB3 (Tensorflow Stable-Baselines vs Pytorch Stable-Baselines3)

A2C (6 seeds)

a2c.pdf
a2c_ant.pdf
a2c_half.pdf
a2c_hopper.pdf
a2c_walker.pdf

PPO (6 seeds)

ppo.pdf
ant_ppo.pdf
half_ppo.pdf
hopper_ppo.pdf
ppo_walker.pdf

SAC (3 seeds)

sac.pdf
sac_ant.pdf
sac_half.pdf
sac_hopper.pdf
sac_walker.pdf

TD3 (3 seeds)

td3.pdf
td3_ant.pdf
td3_half.pdf
td3_hopper.pdf
td3_walker.pdf

See https://paperswithcode.com/paper/generalized-state-dependent-exploration-for for the score that should be reached in 1M (off-policy) or 2M steps (on-policy).

Test envs; PyBullet Envs

Tested with version 0.8.0 (feat/perf-check branch in the two zoos)

SB3 commit hash: cceffd5
rl-zoo commit hash: 99f7dd0321c5beea1a0d775ad6bc043d41f3e2db

Environments	A2C	A2C	PPO	PPO	SAC	SAC	TD3	TD3
	SB2	SB3	SB2	SB3	SB2	SB3	SB2	SB3
HalfCheetah	1859 +/- 161	2003 +/- 54	2186 +/- 260	1976 +/- 479	2833 +/- 21	2757 +/- 53	2530 +/- 141	2774 +/- 35
Ant	2155 +/- 237	2286 +/- 72	2383 +/- 284	2364 +/- 120	3349 +/- 60	3146 +/- 35	3368 +/- 125	3305 +/- 43
Hopper	1457 +/- 75	1627 +/- 158	1166 +/- 287	1567 +/- 339	2391 +/- 238	2422 +/- 168	2542 +/- 79	2429 +/- 126
Walker2D	689 +/- 59	577 +/- 65	1117 +/- 121	1230 +/- 147	2202 +/- 45	2184 +/- 54	1686 +/- 584	2063 +/- 185

Generalized State-Dependent Exploration (gSDE)

gSDE See https://arxiv.org/abs/2005.05719

See https://paperswithcode.com/paper/generalized-state-dependent-exploration-for for the score that should be reached in 1M (off-policy) or 2M steps (on-policy).

on policy (2M steps, 6 seeds):

gsde_onpolicy.pdf
gsde_onpolicy_ant.pdf
gsde_onpolicy_half.pdf
gsde_onpolicy_hopper.pdf
gsde_onpolicy_walker.pdf

off-policy (1M steps, 3 seeds):

gsde_off_policy.pdf
gsde_offpolicy_ant.pdf
gsde_offpolicy_half.pdf
gsde_offpolicy_hopper.pdf
gsde_offpolicy_walker.pdf

SB3 commit hash: b948b7f
rl zoo commit hash: b56c1470c9a958c196f60e814de893050e2469f0

Environments	A2C	A2C	PPO	PPO	SAC	SAC	TD3	TD3
	Gaussian	gSDE	Gaussian	gSDE	Gaussian	gSDE	Gaussian	gSDE
HalfCheetah	2003 +/- 54	2032 +/- 122	1976 +/- 479	2826 +/- 45	2757 +/- 53	2984 +/- 202	2774 +/- 35	2592 +/- 84
Ant	2286 +/- 72	2443 +/- 89	2364 +/- 120	2782 +/- 76	3146 +/- 35	3102 +/- 37	3305 +/- 43	3345 +/- 39
Hopper	1627 +/- 158	1561 +/- 220	1567 +/- 339	2512 +/- 21	2422 +/- 168	2262 +/- 1	2429 +/- 126	2515 +/- 67
Walker2D	577 +/- 65	839 +/- 56	1230 +/- 147	2019 +/- 64	2184 +/- 54	2136 +/- 67	2063 +/- 185	1814 +/- 395

DDPG

Using TD3 hyperparameters as base with some minor adjustements (lr, batch_size) for stability.

6 seeds, 1M steps.

Environments	DDPG
	Gaussian
HalfCheetah	2272 +/- 69
Ant	1651 +/- 407
Hopper	1201 +/- 211
Walker2D	882 +/- 186

Answer 1 · 2020-08-05T14:37:41.000Z

@Miffyli All algorithms match performances except A2C (I ran it with the pytorch rmsprop, just wanted to check), which is a good sign after #110 (table created using the zoo) :

Environments	A2C	A2C	PPO	PPO	SAC	SAC	TD3	TD3
	SB2	SB3	SB2	SB3	SB2	SB3	SB2	SB3
HalfCheetah	1859 +/- 161	1825 +/- 119	2186 +/- 260	1976 +/- 479	2833 +/- 21	2757 +/- 53	2530 +/- 141	2774 +/- 35
Ant	2155 +/- 237	1760 +/- 190	2383 +/- 284	2364 +/- 120	3349 +/- 60	3146 +/- 35	3368 +/- 125	3305 +/- 43
Hopper	1457 +/- 75	1348 +/- 134	1166 +/- 287	1567 +/- 339	2391 +/- 238	2422 +/- 168	2542 +/- 79	2429 +/- 126
Walker2D	689 +/- 59	505 +/- 63	1117 +/- 121	1230 +/- 147	2202 +/- 45	2184 +/- 54	1686 +/- 584	2063 +/- 185

current a2c graphs (trained and plotted using the zoo, those are the deterministic evaluations):
a2c.pdf
a2c_ant.pdf
a2c_half.pdf
a2c_hopper.pdf
a2c_walker.pdf

I will run A2C with the tensorflow RmsProp and also do some run of all algorithm with gSDE enabled (to check that it matches the paper).

Answer 2 · 2020-08-05T16:55:44.000Z

As expected changing the optimizer to tf rmsprop closes the gap.
Will now try to replicate gSDE results.

Answer 3 · 2020-08-06T12:07:32.000Z

Results for gSDE are replicated (in fact, after solving PPO issue in #110 , it is even better, I will need to update the paper).
closing this issue.

Answer 4 · 2021-06-07T01:18:41.000Z

@araffin Thanks for paying so much attention to performance checking. It makes us feel more confident about using SB3.

I assume that for DDPG, TD3 and SAC, you are using the default parameters given in the documentation / paper. Just to give an example, in SB3 doc, for DDPG and TD3, the learning rates are 1e-3 and the batch sizes are 100; for SAC, the learning rate is 3e-4 and the batch size is 256. Personally, I find these hyper-parameter differences unjustified from an algorithmic standpoint. Although I'm aware that this is an effort to match the respective original publications, these are pretty similar algorithms.

Is it possible to test them with shared hyper-parameters? Also, just to double check, where can I find the hyper-parameters you used for this exact replication (gSDE paper or the current SB3 doc)?

Thanks again!

Answer 5 · 2021-06-07T07:14:43.000Z

I assume that for DDPG, TD3 and SAC, you are using the default parameters given in the documentation / paper.

actually, slightly different ones as I'm training on PyBullet envs (different from MuJoCo ones, used in the paper).

You have instructions in the doc ;) I'm using the RL zoo: https://github.com/DLR-RM/rl-baselines3-zoo.

Instructions: https://stable-baselines3.readthedocs.io/en/master/modules/sac.html#how-to-replicate-the-results

Personally, I find these hyper-parameter differences unjustified from an algorithmic standpoint. Although I'm aware that this is an effort to match the respective original publications, these are pretty similar algorithms.

You are completely right. In fact, the original code of TD3 now shares SAC hyperparams (https://github.com/sfujim/TD3)
And you can easily do that in the zoo.

Is it possible to test them with shared hyper-parameters? Also, just to double check, where can I find the hyper-parameters you used for this exact replication (gSDE paper or the current SB3 doc)?

yes you can (but you need to deactivate gSDE for SAC, as gSDE for TD3 is no longer supported).

Also, just to double check, where can I find the hyper-parameters you used for this exact replication (gSDE paper or the current SB3 doc)?

In the RL Zoo. You can even check the learning curves from the saved logs: https://github.com/DLR-RM/rl-trained-agents

Answer 6 · 2021-06-07T08:19:00.000Z

I feel amazed how well documented SB3 is. Thank you so much for the link to instructions, I should have noticed them myself while reading the doc.

Answer 7 · 2021-06-08T07:20:15.000Z

@araffin Just to double-check, I see that you used a linear decay learning rate scheduler for HopperBulletEnv-v0 and Walker2DBulletEnv-v0?

Answer 8 · 2021-06-08T07:36:30.000Z

I see that you used a linear decay learning rate scheduler for HopperBulletEnv-v0 and Walker2DBulletEnv-v0?

yes (also mentioned in the gSDE paper), there are important to stabilize the training.

https://github.com/DLR-RM/rl-trained-agents/blob/master/sac/Walker2DBulletEnv-v0_1/Walker2DBulletEnv-v0/config.yml#L15

Answer 9 · 2021-06-09T02:19:14.000Z

For DDPG, you wrote that you

Using the same hyperparameters as TD3.

Does this mean that, for running DDPG for this performance check, you didn't used the hyper-params specified here:

https://github.com/DLR-RM/rl-trained-agents/tree/master/ddpg

but instead you used the hyper-params specified here:

https://github.com/DLR-RM/rl-trained-agents/tree/master/td3

?

Answer 10 · 2021-06-09T07:14:23.000Z

Does this mean that, for running DDPG for this performance check, you didn't used the hyper-params specified here:

They are the same ...
see https://github.com/DLR-RM/rl-trained-agents/blob/master/ddpg/HalfCheetahBulletEnv-v0_1/HalfCheetahBulletEnv-v0/config.yml and https://github.com/DLR-RM/rl-trained-agents/blob/master/td3/HalfCheetahBulletEnv-v0_1/HalfCheetahBulletEnv-v0/config.yml

Also in https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ddpg.yml

Answer 11 · 2021-06-09T07:17:07.000Z

They are not the same for Ant, for example:

https://github.com/DLR-RM/rl-trained-agents/blob/master/ddpg/AntBulletEnv-v0_1/AntBulletEnv-v0/config.yml

https://github.com/DLR-RM/rl-trained-agents/blob/master/td3/AntBulletEnv-v0_1/AntBulletEnv-v0/config.yml

Answer 12 · 2021-06-09T07:25:37.000Z

They are not the same for Ant, for example:

oh true, I probably did some minor adjustments (reduce lr) to improve stability. I will update my comment.

(I did not spend much time on DDPG as we treat it as a special case of TD3, cf code)

Answer 13 · 2021-07-21T18:22:46.000Z

Hi there,
are the SAC Actor and Critic Loss plots also available somewhere? I'd like to compare mine with the "official" SB3 loss plots.

I'm using a custom environment for robotic grasping and for some reason my critic loss increases after a while. My reward starts to decrease as well and I believe this is connected to the critic loss going up.

I tried various hyper-params already (gradient_steps, train_freq, learning_rate, batch_size) but saw the same behavior throughout. I am keeping ent_coef = 0.01, the auto adjustment only lead to a larger drop in reward unfortunately.

Maybe a LR scheduler helps?

Answer 14 · 2021-07-21T19:58:10.000Z

@axkoenig I believe these are (still) the correct instructions and settings to replicate things: https://stable-baselines3.readthedocs.io/en/master/modules/sac.html#how-to-replicate-the-results (@araffin please correct me if I am wrong).

Answer 15 · 2021-07-22T01:51:08.000Z

Ok thanks, I was just wondering whether they were posted somewhere s.t. I don't need to train the model myself until the end

Answer 16 · 2021-07-22T08:57:12.000Z

@axkoenig I believe these are (still) the correct instructions and settings to replicate things: https://stable-baselines3.readthedocs.io/en/master/modules/sac.html#how-to-replicate-the-results (@araffin please correct me if I am wrong).

Ok thanks, I was just wondering whether they were posted somewhere s.t. I don't need to train the model myself until the end

yes, the RL Zoo is the place to go to replicate results. I saved the training/evaluation reward and the trained agent but not the rest of the metrics (although you can easily reproduce the run normally).
Your issue is probably related to https://discord.com/channels/765294874832273419/767403892446593055/866702257499668492

Answer 17 · 2021-11-23T16:37:06.000Z

Dead @araffin,
I used SB3 for my custom environment, but no matter which algorithm I use, I get the same false result. Could you please help me with which part is wrong? I am pretty sure that my custom environment code is true.
Thanks

Answer 18 · 2021-11-24T07:23:58.000Z

@vahidqo Answering you #672.