google-deepmind/open_spiel

RNaD negative loss and barely any correlation of loss with NashConv

kingsharaman opened this issue · 7 comments

I am training RNaDSolver on Kuhn Poker with this config:

config = rnad.RNaDConfig(
    game_name="kuhn_poker",
    trajectory_max=10,
    state_representation=rnad.StateRepresentation.INFO_SET,
    policy_network_layers=(128,),
    batch_size=256,
    learning_rate=1e-6,
    seed=42
)

2 things I cannot explain:

  1. There are negative loss values
  2. While loss goes down more or less it does not really correlate with the NashConv metric. In the beginning it does but then it seems to stagnate while NashConv goes down nicely and smoothely reaching 0.008.

The second one is a problem because I cannot calculate NashConv for more complicated games so not sure what should I monitor to be able to identify convergence.

I am sorry but we can't help.

Unfortunately the original authors are not responding to questions regarding R-NaD, so it is currently is unmaintained.

We will mark it as such in the README and put up a call for help issue, but if might get removed in a future version.

Quick follow up. There is no guarantee that the loss is correlated with NashConv. These are neural networks, and small changes in loss can lead to quite different policies, especially from the perspective of NashConv

Have you checked results with the paper and/or original F-FoReL paper? (Don't remember if we ever reported the values of the losses)

@kingsharaman note, that the loss returned by the step() is actually a sum of two losses (value and nerd) as seen here. It looks that the value loss is always positive, while the nerd loss can have negative values, so can their sum.
I've been playing with this super promising algorithm for quite a long time with no success so far, for a card game that is supposed to have much smaller complexity than Stratego. I even made a limited version of that card game and currently running a long training session, while evaluating against a player that takes random actions. From the start, the win rate for the agent being trained was ~54%, then after peaking at 5k steps with 56% it gradually went down to 52% on average and stayed such up until recently. After been running for 11 days and ~486K steps it still averages on 53%.
In their example, they ran it on leduc_poker for 7M steps. In my case, each step takes ~2 seconds at batch size 512, so it is still 160 days to go. Hopefully the trend (if any) should be seen starting from 500K steps to 1M steps.

@lanctot thank you for the suggestion, there are really useful graphs in this paper!

@too-far-away thank you for the clarification! As for results it worked for me on KuhnPoker (see config in original question). As for performance I analyzed it and 50% of the time is spent on selecting the random action from the policy (np.random.choice). As for results with your game, depending on the amount of luck in it 56% win rate might mean anything from superhuman play to first time player strategy.

...also playing around with learning rate might help. It was also written here somewhere that the gradient clip of 10000 might be too high (for NeRD and overall, there are 2 clip settings). I set both to 10.

@marton-avrios I used the learning rate of 1e-7, didn't touch the clip. In my case the trajectory_max=100 so that the games are much longer and it takes more time to generate a batch of 512 games, although i had it parallelized.
Has setting the clip to 10 made things better?

Not that much but mine was only a toy example.