Some questions about rewards in the training process

Question

Some questions about rewards in the training process

Closed this issue 2 years ago · 2 comments

Greetings.Recently,I run a simulation with your code.I used the default values for all the parameters in the simulation without changing anything.I found that,at the end of the first episode,the reward can reach 11 or higher.

But,at the end of the second episode,the reward was stuck in 2 or less

What's more,the performance of subsequent episodes is much lower than that of the first episode.I don't know why it is like this. Shouldn't the performance of each episode increase as the number of training episode increases?Why is the performance best in the first episode？

Answer 1 · 2023-04-22T09:19:31.000Z

Reinforcement learning algorithms are prone to sudden divergence (just a divergence out of nowhere) when they combine:

Function approximation: e.g., the use of neural networks)
Off-policy learning: e.g., sampling from the experience replay buffer)
Bootstrapping: what the conventional Q-learning based on

this issue is also known as the deadly triad problem and has been widely studied in the RL literature. So, what you encounter is that the DDPG agent suffers from the deadly triad, and cannot improve anymore once it's diverged.

What you can do is simply consider a single episode. Determine the number of the step where the DDPG agent obtains good reward values. From now on, terminate the training when the DDPG agent reaches the specified number of training steps.

Training it further is inclined to the deadly triad issue. However, the more advanced off-policy algorithms such as Soft Actor-Critic are less prone to the deadly triad, which I empirically verified in my paper. You may also want to take a look at my paper's repo, and there, you can find the implementation of SAC.

Answer 2 · 2023-04-22T11:12:47.000Z

OK.Thanks for your help