converger issue
fangthu opened this issue · 10 comments
hi, Recently I used the your template to learn some simple maneuvers.
But I find the output always converge to -1 or +1 if the episodes is large enough, and if the output boundary is [-1 1].
Have you ever met with this situation? or do you know how to solve?
Best
I haven't had a chance to look into this issue, but an initial suggestion is to add a learning rate schedule with tf.train.exponential_decay
for both networks. Also, set a desired loss or average reward, and then stop training the networks rather than continue updating the weights once you hit your target. If the networks have learned well enough before you have trained them for X number of episodes, stopping early is recommended to prevent the weights from sliding into some worse local minima.
I think with the Adam optimiser, you don't need learning decay. I did email David Silver about this, and he said It's usually possible to solve pendulum with bang-bang control - so if it's stabilising and achieving desired reward, maybe -1 or +1 is okay.
Ah, right. Yeah, this implementation is pretty simple, so it works for a task like pendulum. More tricks and adjusting would definitely be needed for a more complex problem.
Hi, may I ask you a question, how can I plot such a chart as you posted with tensorboard.
Hi @GoingMyWay, it's a bit tricky but you can create a histogram for TensorBoard using a Numpy array. I used a custom function which does this (bit hacky, but I haven't found a better way yet) https://github.com/Anjum48/rl-examples/blob/master/dppg/ddpg.py#L204
Hope this helps!
Thank you, BTW, how many episodes will it take to train Pendulum-v0
, I trained it with 10k episodes, but it doesn't converge now.
I found that in my implementation of DDPG (which is pretty similar to how @pemami4911 did it), it converges after 100-200 episodes (FYI, I can't get it to learn this fast with other algorithms e.g. A3C or PPO).
In my experience, DDPG is very sensitive to how the OU noise is added to the actions, so I added an exponential decay like this:
epsilon = np.exp(-i/TAU2)
a += epsilon * exploration_noise.noise() / env.action_space.high
with TAU2 = 25
(this should be dependent on the environment). An interesting area of research which I still need to try is adding noise to the network parameters rather than the actions (see https://github.com/openai/baselines/tree/master/baselines/ddpg)
Thank you, I will run your code. The results from pemami4911's code is
From curves of avg max Q and reward, I couldn't tell if it converges or not.
@GoingMyWay I suspect that it is converging, but because the noise term is still being added to the actions (i.e. it hasn't decayed to zero after learning), the actions are too noisy to get a smooth looking reward curve. For example, the Pendulum might be nicely balanced in the upright position, but the random noise added to the actions will knock it off balance, hence causing the poor scores