vwxyzjn/cleanrl

Are you interested in adding MPO

Closed this issue · 4 comments

Hi, the MPO algorithm, first published in 2018, has become, it seems the preferred deepmind algorithm.

They published 2018a introducing it and then several variations and improvements : 2018b, 2020, 2022.

Recently in 2023 they applied it to real robots to play soccer.

There are several deepmind's official implementations : tensorflow implementation, jax implementation as well as an example setting some hyperparameters different from the default ones in the implementation to "match the published results".

The deepmind's implementations are modular and rely on several of their own libraries, making it difficult to understand the code. Therefore I think it might be a good idea to add it to clean RL. Other open versions I've come across didn't get it quite right.

I've re-implemented it in torch (the 2018b version) in a minimal one-file implementation in the style of clean RL and I can match the published results with it so I wanted to know if you would be interested in adding it to clean RL? And also in benchmarking it?

@Jogima-cyber, thanks for raising the issue. MPO is an interesting algorithm that I would love to learn more about. It would be great to have a CleanRL MPO implementation and its associated benchmarks.

I have some quick questions.

  1. Is MPO preferred by Deepmind mainly for continuous control? I noticed that in the Muesli paper, it is shown that MPO even underperforms vanilla policy gradient.
image
  1. I can match the published results with it

Very nice! Are you referring to the published results in DMC environments? We also have results on DMC environments with PPO, so it would be quite interesting to compare :)

To answer your two questions :

  1. Sorry, I wasn't precise enough, I meant robotics continuous control. For other continuous control tasks, they made a closed source online version of MPO called V-MPO
  2. I've only tested it on Hopper-v4 for now and I can match (even outperform, but I've used only 1 seed) the results presented in the introduction paper, 2018a

Should I make a pull request with my MPO torch implementation?

I see. That's good to know. A PR with MPO torch implementation sounds good. We'd probably want to run benchmark experiments in 24 envs with three random seeds, such as in Figure 4. The contribution guide can be found here https://docs.cleanrl.dev/contribution/.

Thank you, I'm on it. Expect to hear from me soon regarding this then (on discord I think). I'm closing the issue.