Some questions about rewards in the training process
Closed this issue · 2 comments
Greetings.Recently,I run a simulation with your code.I used the default values for all the parameters in the simulation without changing anything.I found that,at the end of the first episode,the reward can reach 11 or higher.
But,at the end of the second episode,the reward was stuck in 2 or less
What's more,the performance of subsequent episodes is much lower than that of the first episode.I don't know why it is like this. Shouldn't the performance of each episode increase as the number of training episode increases?Why is the performance best in the first episode?
Reinforcement learning algorithms are prone to sudden divergence (just a divergence out of nowhere) when they combine:
- Function approximation: e.g., the use of neural networks)
- Off-policy learning: e.g., sampling from the experience replay buffer)
- Bootstrapping: what the conventional Q-learning based on
this issue is also known as the deadly triad problem and has been widely studied in the RL literature. So, what you encounter is that the DDPG agent suffers from the deadly triad, and cannot improve anymore once it's diverged.
What you can do is simply consider a single episode. Determine the number of the step where the DDPG agent obtains good reward values. From now on, terminate the training when the DDPG agent reaches the specified number of training steps.
Training it further is inclined to the deadly triad issue. However, the more advanced off-policy algorithms such as Soft Actor-Critic are less prone to the deadly triad, which I empirically verified in my paper. You may also want to take a look at my paper's repo, and there, you can find the implementation of SAC.
OK.Thanks for your help