mokemokechicken/reversi-alpha-zero

It may forget pertinent information about positions that it no longer visits.

apollo-time opened this issue ยท 21 comments

I see my model don't be improved anymore.
Moreover I found "It may forget pertinent information about positions that it no longer visits" as ThomasWAnthony's when opinion select action unusually.
@mokemokechicken, @gooooloo How about it?

@apollo-time

I think that there is that's possibility,
and if we want to improve the model more and more, we need larger sim_per_move and self-play dataset.

I have a simple hypothesis that

  • upper performace(=strength) of model is decided by sim_per_move.
  • speed of changing(โ‰’improvement) is decided by speed of generating self-play data and size of self-play dataset(small size is faster).
  • generalization performace of model is decided by size of self-play dataset(large size is more general).

so, I feel that increasing sim_per_move and dataset size gradually is effective.
(I think that Human also do that to become professional.)

I think larger slim_per_move and self-play dataset can't resolve no longer visits problem, because the unusually positions can't be selected by self-play MCTS.
So I try select fully random action sometimes in self-play, and ignore previous history of the random action.

@mokemokechicken I asked @gooooloo a similar question in other thread, but what is the default ratio of the number of games per gradient update ratio of your algorithm? I guess the ratio is important for the performance, since it behaves like sims/move, which is undoubtedly important.

@AranKomat

what is the default ratio of the number of games per gradient update ratio of your algorithm?

I do not know which number to answer concretely, but the resulting speed is as follows.

setting

  • batch size: 256
  • sim per move: 400
  • (nb_game_in_file, max_file_num): (5, 300)

speed

  • 80 seconds per 1 self-play game
  • 400 positions per 1 self-play game
  • 150 seconds per 200 steps(bs=256) -> 150 seconds per 200*256 positions

so

  • Training: 341 positions / seconds (=200*256/160)
  • SelfPlay: 5 positions / seconds (=400 / 80)
  • Training/SelfPlay Ratio: 68 (=341/5)

Maybe, it means that 1 position is learned 68 times regardless (nb_game_in_file, max_file_num).

Thanks for your answer. In the case of Go with AlphaZero, 700k minibatches (2048 positions each) and 21 million self-play games were performed. Assuming that each game ended with 150 stones (positions) placed, 700k x 2048/(21m x 150)=0.44 [trained position]/[self-play-generated position], which is much less than 68. So, I guess you can improve your performance with more self-plays per update. Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration. Since having more games generated means more diverse data than having more sims/move, so spending more time on self-play may be more beneficial than more sims/move. But in practice, since your alg doesn't allow multi-processing (of multiple games) as done by Akababa, my suggestion may be not useful. But this may be useful for @gooooloo .

@AranKomat

I guess you can improve your performance with more self-plays per update.

I think so too.
In my environment, although GPU usage is already 100%(by self-play and training),
implementing multiprocess self-play will increase self-play games per training.

So I am planing to implement multiprocess self-play,
However, it is under consideration whether or not it really works with the present method.

I am testing on feature/multiprocess_selfplay,

when 16 parallel in self-play,

  • 580 seconds per 1 self-play game (16 parallel) -> 36 seconds per self-play game
  • 400 positions per 1 self-play game
  • 225 seconds per 200 steps(bs=256) -> 225 seconds per 200*256 positions

so

  • Training: 228 positions / seconds (=200*256/225)
  • SelfPlay: 11 positions / seconds (=400 / 36)
  • Training/SelfPlay Ratio: 21 (=228/11)

Cool. So, multi-processing successfully decreased the ratio and achieved 36s per game under 400 sims/move. Now, it suffices to elucidate the trade-off between training/selfplay ratio and sims/move. I'm excited for your subsequent announcements!

I also added wait to optimizer to change the ratio.

Now,

  • 164 self-play game per 1 hour -> 22(=3600/164) seconds per self-play game
  • 400 positions per 1 self-play game
  • 225 * 2 seconds per 200 steps(bs=256) -> 450 seconds per 200 * 256 positions

so

  • Training: 113 positions / seconds (=200*256/450)
  • SelfPlay: 18 positions / seconds (= 400 / 22)
  • Training/SelfPlay Ratio: 6.2 (=113/18)

@AranKomat

Mine is:

  • 30 processes for self-play, about 150 seconds per game per process, gives 5 seconds per game in average.
  • about 12 minutes per 100 steps training, batch size = 3072, gives 426 positions per second (=3072*100/12/60)

I actually don't understand below number @mokemokechicken mentioned:

400 positions per 1 self-play game

But if I just use this number, then I have self-play speed: 80 positions per second (=400/5).
Then Training/SelfPlay Ration: 5.3 (=426/80)

Thanks for your answer. In the case of Go with AlphaZero, 700k minibatches (2048 positions each) and 21 million self-play games were performed. Assuming that each game ended with 150 stones (positions) placed, 700k x 2048/(21m x 150)=0.44 [trained position]/[self-play-generated position], which is much less than 68. So, I guess you can improve your performance with more self-plays per update. Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration. Since having more games generated means more diverse data than having more sims/move, so spending more time on self-play may be more beneficial than more sims/move. But in practice, since your alg doesn't allow multi-processing (of multiple games) as done by Akababa, my suggestion may be not useful. But this may be useful for @gooooloo .

Thanks @AranKomat . I didn't see this post until just now...

I guess you can improve your performance with more self-plays per update

Yes, I also think so. Deepmind uses 2000+ or 4000+ TPU for selfplay (as Aja Huang says in a post, I just can't remember the link). We can see the self play performance is important.

Maybe the performance gain by increasing the sims/move from 100 to 800 was because you had a small self-play/training ratio, that is, you had too little exploration.

Actually I was getting an smaller selfplay/training ratio when increasing sims/move from 100 to 800. Although I also introduced multi process implementation at that time, the overall self play game speed is a little bit slower than before. Yet I observe the AI strength improvement.

@gooooloo In AlphaZero, staggering 5000 TPUs were used, so I totally agree. It's weird but nice that increased sims/move resulted in a smaller ratio. Hopefully, @mokemokechicken and others will observe a similar phenomena.

400 positions per 1 self-play game

Note:
I used (nb_game_in_file, max_file_num)=(5, 300), so the number of total games in training data was 1500 (games).
My training dataset size was about 600k (positions).
So, 600k / 1500 = 400 (position/game).

I used (nb_game_in_file, max_file_num)=(5, 300), so the number of total games in training data was 1500 (games).
My training dataset size was about 600k (positions).
So, 600k / 1500 = 400 (position/game).

But a reversi game has up to 60 position to move, isn't it? Event with up to 5 "PASS" move, it is 65. Then even with game state flip and rotation, it is at most 260.

UPDATE:
Oh my fault, "flip and rotation" gives a x8 multiplication, not x4. Then it makes sense. 400/8=50, you are playing 50 moves per game, giving you have a resignation mechanism.

... had a small self-play/training ratio

It's weird ... that increased sims/move resulted in a smaller ratio

The ratio is # of selp play moves / # of trained moves. I increased # sims per move, then self play got slower, then # of self play moves smaller. But training module not changed. So the total ratio got smaller. Isn't it?

@gooooloo Sorry, I thought you were talking about training/self-play ratio, but it was opposite. My mistake. I also agree with you about the number of positions per game.

@AranKomat I made a mistake calculating. Please see that post again, I modified it.

@gooooloo Well, that makes sense. But when I said 150 stones on average Go game, I didn't take into account the symmetries, so for fair comparison I didn't consider symmetries of reversi, which has the same set of symmetries as Go. Sorry for not being explicit. Since what we're concerned with is the ratio between our training/self-play ratio (5.3 after symmetries) vs. AZ's training/self-play ratio (about 0.44, but it's 0.44/8=0.055 after symmetries), there's still 100 times of difference, which is reasonable given the number of GPUs we're using.

It is strange that training/self-play ratio becomes under 1. It means that there are positions not used in training.
So, I think the ratio was almost 1.

The ratio of 0.44 was obtained from AlphaZero, where symmetry wasn't exploited. Also, Shogi and Chess cannot exploit symmetries, so they set the self-play vs training ratio of AlphaZero based on the assumption that self-play data isn't necessarily as plentiful as in symmetric games. Without symmetry, the ratio is 0.44, which is closer to 1. The ratio for Shogi and Chess may be even closer to 1. Also, in symmetric games without symmetric data augmentation, the NN quickly learns symmetry, which was demonstrated by AZ being superior to AGZ in Go. Considering the eventual meaninglessness of symmetric data augmentation, the net ratio of @gooooloo becomes 5.3*8=42.4. So, he needs at least 42 times more GPUs for self-play to get to 1.

@AranKomat @mokemokechicken I double checked my pipeline's performance, should be 25 processes + 180 second per game per process, which gives 7 seconds per game in average. Then My ratio should be about 7.*(=426/(400/7)), not 5.3.