mokemokechicken/reversi-alpha-zero

AlphaZero Approach

Closed this issue · 2 comments

Hi,

Great work with your repository, impressive stuff, Just interested to see when you are running the software, with self-play and optimise at the same time, how many self-play games do you aim to complete between the optimiser releasing a new model? I wonder as I would have thought if not enough games are completed the model would over-fit?

Thanks jack

Hi @JackThomson2

how many self-play games do you aim to complete between the optimiser releasing a new model?

I think that "training/self-play ratio" mentioned in #38 is important.
Ideally almost 1 is best.

In my current environment,

  • self-play
    • about 250 self-games per hour
    • about 400 positions per game
    • so, generate 100k positions per hour
  • optimize
    • 200 step per 215 seconds(no wait)
      • 200 step per 430 seconds(including wait)
    • 256 positions per step
    • 3600/430 * 200 * 256 ≒ 428k positions per hour

so the ratio is 4.28.

I wonder as I would have thought if not enough games are completed the model would over-fit?

I think so too.
Actually when the ratio was high(30~), training loss tended to be low.
However, I felt that the model did not get better than a certain extent.

Thanks for this detailed reply, very insightful. In my project, I see much better results the AlphaGo approach not sure what I am doing wrong. Keep up the great work with this though, look forward to seeing your progress.