The implementation of paper Effective Diversity in Population Based Reinforcement Learning.
Install pbrl and clone this repo:
git clone https://github.com/jjccero/DvD_TD3
cd DvD_TD3
python train_dvd.py
When DPP kernel matrix uses dot product kernel (or cosine similarity, see loss.py) instead of RBF as entry, we can take a linear mapping to make the value between 0 and 1. The beta makes the matrix positive-definite.
I'm not sure whether to take the logarithms of determinant. The author believes that this does not matter. In addition, I find that the numerical instability of logdet may be the reason for the gradients explosion or disappearance of policy networks, so I use det instead of logdet for optimization.
In order to scale observations and rewards (when obs_norm=True), I calculate the local RunningMeanStd for each policy. When using the central Q-function, it needs to calculate the global RunningMeanStd via local RunningMeanStds.
Thank Jack Parker-Holder (the author of the paper) for his help. And welcome to get in touch with me if you have any questions about this implementation.