Well done! But is it working?

Question

Well done! But is it working?

lsqshr opened this issue 9 years ago · 9 comments

Hi,

I was looking for such a repo to understand how to implement ddpg. Thanks for sharing.

I tried the Reacher-v1. However it does not seem to converge. So it this repo currently working or is it still under construction?

Also, have you considered using Keras to make things cleaner?

Cheers!

rmst commented 9 years ago

Thanks

Answer 1 · 2016-06-16T07:25:07.000Z

Hi, thanks!

Yes, there is a bug in ddpg. I'm currently investigating. Another problem with the mujoco envs is that they are not normalized (e.g. in Reacher the dimensions representing the velocities have a 20x higher variance than the other dimensions). Batch normalization would alleviate this but it's not implemented yet either. So the repo is still under construction but I'm super happy to get feedback!

I haven't worked with Keras yet but when I looked into the docs it didn't seem obvious to me how to optimize the policy parameters with respect to the Q-network. In TF this is pretty straightforward because of automatic differentiation.

Answer 2 · 2016-06-16T07:59:03.000Z

It is great to know the potential problem here. If it is the batch normalisation, then you should definitely try keras with one line. I made an example with keras for discrete vanilla Q network might give you some hints. Looking forward to the working version.

Answer 3 · 2016-06-16T09:32:53.000Z

Yes the DQN algortihm is probably easy to implement in Keras because you only have one NN. But in DDPG you also have a NN for the policy which is trained in a nonstandard way (via policy gradients). How would you implement that in Keras?

Answer 4 · 2016-06-16T10:11:51.000Z

You got me dude. I'm looking into it.

On Jun 16, 2016, at 7:32 PM, Simon Ramstedt notifications@github.com wrote:

Yes the DQN algortihm is probably easy to implement in Keras because you only have one NN. But in DDPG you also have a NN for the policy which is trained in a nonstandard way (via policy gradients). How would you implement that in Keras?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Answer 5 · 2016-06-30T22:08:59.000Z

Hey, just wanted you to know that ddpg is now converging on Reacher-v1. The main problem was the reward/return scaling. In order for ddpg to work the returns have to have a certain magnitude. That is simply a problem of the algorithm. However Deepmind just released a new paper (PopArt) that addresses this issue. Any news regarding Keras?

Answer 6 · 2016-06-30T23:56:56.000Z

It's great you made it work dude. I think J.Shulman used keras in his
modular_rl though it has theano backend (
https://github.com/joschu/modular_rl). Also there is a working version of
ddpg in rl_lab (https://github.com/rllab/rllab) they used a similar NN
wrapper called lasagne. May worth a look at them for improvement.

Best!

On 1 July 2016 at 08:08, Simon Ramstedt notifications@github.com wrote:

Hey, just wanted you to know that ddpg is now converging
https://gym.openai.com/evaluations/eval_jMAmHzFQQnSeclUQ55mU5Q on
Reacher-v1. The main problem was the reward/return scaling. In order for
ddpg to work the returns have to have a certain magnitude. That is simply a
problem of the algorithm. However Deepmind just released a new paper (
PopArt
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/popart.pdf)
that addresses this issue. Any news regarding Keras?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABfshYeDVArUFAMGj30299BailG86rsAks5qRD57gaJpZM4I28mo
.

SIQI LIU / PhD Candidate in University of Sydney
+61(0)435835978/ sliu4512

[image: Facebook] https://www.facebook.com/siqi.liu.395 [image: Google
Plus] https://plus.google.com/113331763673998670565/ [image: Linkedin]
http://htmlsig.com/at.linkedin.com/pub/siqi-liu/55/3b4/622/

This e-mail message may contain confidential or legally privileged
information and is intended only for the use of the intended recipient(s).
Any unauthorized disclosure, dissemination, distribution, copying or the
taking of any action in reliance on the information herein is prohibited.
E-mails are not secure and cannot be guaranteed to be error free as they
can be intercepted, amended, or contain viruses. Anyone who communicates
with us by e-mail is deemed to have accepted these risks. Company Name is
not responsible for errors or omissions in this message and denies any
responsibility for any damage arising from the use of e-mail. Any opinion
and other statement contained in this message and any attachment are solely
those of the author and do not necessarily represent those of the company.

Answer 7 · 2016-08-17T14:55:13.000Z

Update: keras-rl might be interesting for you

Answer 8 · 2016-08-18T07:40:13.000Z

Wonderful job mate!

On 18 August 2016 at 00:55, Simon Ramstedt notifications@github.com wrote:

Update: keras-rl https://github.com/matthiasplappert/keras-rl might be
interesting for you

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABfshR2dbw4YMQQVuy1Wluzh2YGw8fTGks5qgyDSgaJpZM4I28mo
.

SIQI LIU / PhD Candidate in University of Sydney
+61(0)435835978/ sliu4512

[image: Facebook] https://www.facebook.com/siqi.liu.395 [image: Google
Plus] https://plus.google.com/113331763673998670565/ [image: Linkedin]
http://htmlsig.com/at.linkedin.com/pub/siqi-liu/55/3b4/622/

This e-mail message may contain confidential or legally privileged
information and is intended only for the use of the intended recipient(s).
Any unauthorized disclosure, dissemination, distribution, copying or the
taking of any action in reliance on the information herein is prohibited.
E-mails are not secure and cannot be guaranteed to be error free as they
can be intercepted, amended, or contain viruses. Anyone who communicates
with us by e-mail is deemed to have accepted these risks. Company Name is
not responsible for errors or omissions in this message and denies any
responsibility for any damage arising from the use of e-mail. Any opinion
and other statement contained in this message and any attachment are solely
those of the author and do not necessarily represent those of the company.