chandar-lab/Lifelong-Hanabi

Discussion on the baselines

Closed this issue · 3 comments

Hi, guys. This is really a great project!

I am interested in the choice of the baselines. There are two branches of successful Hanabi algorithms: q-learning (R2D2/SAD) and actor-critic (MAPPO). Both methods achieve SOTA performance on Hanabi. I wonder if you have benchmarked MAPPO, and do you guys think it fits the lifelong-learning framework?

Thanks!

Hi @peppacat,

That's a good point. You probably are referring to this paper and I agree that it would be interesting to test the zero-shot coordination performance of MAPPO in the case of Hanabi and of course it can be used as one of the candidates in our population of agents.

However, according to the original Hanabi paper, a specific implementation of actor-critic methods didn't have a great zero-shot coordination performance even though achieved high scores in a single Hanabi game.

Hadi

Closing this issue, feel free to re-open if you have more queries. Thanks!

Hi, guy. Sorry for the late reply. I fully understand your concern.

I was reading a related paper these days. May I ask what is the intuition behind the thought "actor-critic methods are not good at zero-shot coordination (or, lead to non-diverse policies) "?