mokemokechicken/reversi-alpha-zero

Is it multiple searching at the same time?

Opened this issue · 24 comments

coroutine_list = []
for it in range(self.play_config.simulation_num_per_move):
cor = self.start_search_my_move(own, enemy)
coroutine_list.append(cor)
coroutine_list.append(self.prediction_worker())
loop.run_until_complete(asyncio.gather(*coroutine_list))

I first think this code is searching in the simulation_num_per_move threads at the same time.
But I see async function is not called in the multi thread.
How about it, and how can I search three in multiple threads?

Agreed, I've got a system here with an Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz and two 1080Ti's. Running run.py self is not saturating GPU:0 nor is it saturating the CPU cores. It should be possible to run the MCTS simulations fully in parallel right, thus saturating the GPU. I'm fairly new to asyncio but this seems to indicate that true parallelism is possible (It does not).

@apollo-time

I first think this code is searching in the simulation_num_per_move threads at the same time. But I see async function is not called in the multi thread.
How about it, and how can I search three in multiple threads?

Yes, async function is not called in multithread.
If you want to run MCTS in multithread (or multiprocess), implementation needs to be changed drastically.

In CPython,
if you want to use multi cpu cores,
it is necessary to use multiprocessing rather than multithreading.
(see: https://stackoverflow.com/questions/4496680/python-threads-all-executing-on-a-single-core)

Hi @Error323,

If you have rich computation resource, there are some ways to take advantage of them.

  • You can run multiple self at the same time.
  • Although it is difficult to accelerate opt manually, Keras or TF library can use rich computation resource well.
  • eval can be accelerated by running games in parallel (it is required to change implementation), or try AlphaZero method that does not use eval process (I am trying https://github.com/mokemokechicken/reversi-alpha-zero/tree/feature/alphazero).

Hi @mokemokechicken,

Thanks for your response. I don't see how multiple self instances follow different trajectories. Can you elaborate? (s, a) Bookkeeping isn't shared across instances right?

@Error323

(s, a) Bookkeeping isn't shared across instances right?

Right.

how multiple self instances follow different trajectories. Can you elaborate?

self-play has some randomness.

Right, I see. I'm also still confused as to what exactly defines a training example (s_t, pi_t, z_t). Is z_t in {-1,0,+1}? Is s_t then any node in the game tree?

Is z_t in {-1,0,+1}?

Yes.
All z_t are the same as z_FINAL.

Is s_t then any node in the game tree?

Set of s_t includes states actually appeared in the game.
Set of s_t does not include states appeared only in MCTS simulation nodes.

Ok so a single game produces a dataset with a single outcome z from a player's perspective. Each game move is obtained by MCTS trajectories. Each node in each trajectory is guided by the neural network? And each trajectory simulates to the end of a game or a resign threshold right?

This implies that the value head of the NN only trains on z samples in {-1,0+1} and from that produces a real valued number indicating the expected reward. Why is this better than learning from the mean z(s)?

Each node in each trajectory is guided by the neural network?

Yes.
Basically, each move is guided by policy P and value V of NN output.

And each trajectory simulates to the end of a game or a resign threshold right?

Do you mean like rollouts? AlphaGo Zero(and AlphaZero) does not use rollouts.
In AlphaGo Zero, for example,
each move is decided by 1600 simulations.
Basically each simulation is finished by 'expanding node' (and the end of a game).

This implies that the value head of the NN only trains on z samples in {-1,0+1} and from that produces a real valued number indicating the expected reward. Why is this better than learning from the mean z(s)?

Finally NN learns mean of z(s) as value V.

Ahhh this makes things click for me. Thank you.

To get back on topic. I think the Python GIL is presenting a bottleneck here. Ideally you'd want to start multiple agents from a single executable for a single physical machine such that the NN queues never stall and (s,a) bookkeeping is optimal.

Ideally you'd want to start multiple agents from a single executable for a single physical machine such that the NN queues never stall and (s,a) bookkeeping is optimal.

It is certainly so.
Ideally prediction function is shared with many MCTS process(never stall, save GPU memory).
Ideally maybe it is better for prediction worker to serve the prediction by TCP/IP communication for multi process MCTS.
(I guess DeepMind did so.)

Did you test with multithread search instead of coroutine?

Sorry guys, I made a very stupid mistake in my local test. My previous conclusion about single thread's performance is totally wrong. Please ignore that. Sorry for that.

https://github.com/Akababa/Chess-Zero/blob/nohistory/src/chess_zero/agent/player_chess.py
I see @Akababa implement multi thread search tree, but I know python muti thread is not parallel task because GIL.
OMG.

Hi, nice project! It's been very useful to us working on the chess version. Please feel free to use any part of my code if it's helpful.

I did implement multithreading and got a speedup over asyncio, but as @apollo-time pointed out it was still limited by GIL. So I made a multiprocessing worker using pipes (similar to @mokemokechicken's idea of using TCP sockets) which basically maxed out my GPU usage. However the drawback of my implementation is Windows only supports SELECT system call for 63 objects, so it may not be enough for those with better GPUs.

Happy holidays!

@Akababa I believe multithreading has same speed with asyncio in Python.

I did get a significant speedup somehow, although it might be because I rewrote almost the entire player.py and made a bunch of subconscious optimizations along the way. It's hard to say for sure but at least I think it's cleaner and easier to understand with locks than while loops and time.sleep.

@mokemokechicken @Akababa and others: thanks for developing an open source AlphaZero. If I understand correctly, there's still a huge difference in per-move time between Leela Zero and ReversiAZ, as you may be aware. With a single GTX 1080ti, it takes about 100s per game with ReversiAZ, which translates to 100/64=1.6s per move. On the other hand, per-move takes 0.5s for Leela Zero (by calling multiple leelaz instances) without cudnn and with larger state space of Go (which quadruples the FLOPS of each inference) and double the number of simulations per move, i.e., 13 times faster by not taking speedup by cudnn into account, which may be alleviated if one calls multiple self's of self-play in ReversiAZ. This is based on the assumption that there is not much difference between these algorithms in terms of the depth of architecture and the number of filters, so please correct me if I'm wrong. Akababa claimed x2 speedup over Reversi Alpha Zero, though.

I've heard that C++ makes classical MCTS 1000x faster than Python, but I don't see why it makes AlphaZero significantly faster, since the speed bottleneck resides in inference of NN in self-play, which apparently has nothing to do with C++ or Python, since Leelaz uses OpenCL (without cudnn), whereas ReversiAZ uses CUDA+cudnn, especially when GPU usage is 100% as in Akababa's case. For faster inference, one has to have a large mini-batch of various states s's. I don't know what the average minibatch size of ReversiAZ is (could you tell me your estimate?), but I hope it can be somehow arbitrarily scaled up (like 2048), since I'm going to use AlphaZero for text generation, which enables a large minibatch with a single GPU. In any case, though I'm not familiar with GIL or multi-threading, if ReversiAZ uses a moderate size of minibatch like 32 or 64 for inference, can we conclude that GIL is the real culprit? Then, can't we just use C++ only at this point by taking a relevant snippet from Leelaz?

By the way, if you look at Figure 3 of AGZ paper and Figure 2 of AZ paper, you may notice that decreasing the number of simulations from 800 (corresponding to the origin of the graph of Fig. 2) may have a marginal effect on the resulting performance in practical sense, even though elo scale used in these figures cannot be compared directly. In practice, we will never achieve the convergence with a large architecture as used in the original paper. Fig. 3 shows that the largest increase in elo occurred in the first 100k iterations. This means that, to maximize elo with our limited budget, it may be better to reduce the num of simulations by 10 fold and achieve x10 iterations, as we probably can't reach 100k iters otherwise and this will give us the maximum performance with a limited budget. If the number of simulations is rather immaterial in our case, then the size of architecture becomes rather important and therefore cannot be compromised much. My claim is somewhat supported by the fact that DeepMind decided to reduce the num of sims from 1600 to 800, that the performance of AG greatly increased after adapting a large architecture (i.e. the performance increase relies on NN rather than large num of simulations) and that no rollout was needed (again demonstrating it's better to spend more time on NN over simulations). Is there any obvious counterexample to my claim?

Well for one I believe LZ uses a 5x64 model while we're using 7x256, and most of the FLOPs are in the conv layers, not the input layer. LZ may be more I/O constrained though because of the large input.

I believe "classical MCTS" is just playout to a terminal node which is assumed to have a free evaluation, so that's what the 1000x speedup is referencing. Or in AZ case it allows them to use a single CPU to run MCTS on several tensor processors. In general Python isn't the first choice for applications like this but it's convenient to collaborate and do small-scale stuff in.

There's probably someone who knows more about the simulations-noise tradeoff of MCTS than me, but I think your 10x estimation is reasonable if you have roughly equal training and inference power.

@Akababa Thank you for your response. It's good to know that LZ uses 64 filters only and confirm that the setting tried in ReversiAZ is 'normal' one rather than 'mini.' But I didn't get what 5 or 7 stands for. I understand that most of the FLOPS reside in the conv layers. Since there's no stride larger than 1, if I understand correctly, all the conv layer takes bsx256x19x19 (for Go) as an input and gives bsx256x19x19 as an output, which makes the FLOPS gigantic. So, the FLOPS is proportional to state size times the square of filter number. Since I now learned that LeelaZ uses 64 filter num, ReversiAZ is slightly faster (16/13=1.2 times) per-move, which relieved me, but it should be much faster, since it uses a cudnn. Is 5 or 7 I mentioned above something that accounts for this?

My assumption that C++ makes AZ significantly faster than Python is based on my now possibly false assumption that LeelaZ is much faster than ReversiAZ, so never mind about that. I greatly appreciate your clarification about the difference in the number of filters used. I guess the use of Python is unavoidable due to the vast applicability to other deep learning stuffs.

What does "noise" of "simulation-noise" refer to? Does it refer to the number of iterations? I'm thinking of a single 1080ti, and since inference costs about 5000 times as much as training does mostly due to the simulation, I believe this will work. If we encourage people to report their new neural-MCTS algorithm based on its elo rating on Reversi with 10~50 simulations per move with much smaller architecture, this may become a convenient benchmark like Imagenet. For this, probably the architecture's spatial scale should shrink and enlarge like encoder-decoder to reduce the FLOPs. Also, rather than fixing a low number of sim/move, we can decrease the number of simulations from 800 as time goes on.

Yea you're absolutely right about the 19x19, I missed that part. RAZ has 7 residual blocks while LZ has 5, and FLOPs is roughly proportional to number of blocks. (coincidentally, chess and reversi are both 8x8, lol)

Using Python isn't so necessary (as LZ demonstrated) as convenient for development at the cost of convenience for the end user (which is more important in the long run). I'm still tinkering with the models (want a good balance between expressive power and practicality) but @benediamond has started a promising port of LZ.

I guess what I mean by "simulation-noise tradeoff" is that as the number of simulations approaches infinity, the MCTS value converges to the minimax value. So fewer simulations will give a "noisier" minimax value.

Oh does RAZ have 7 res blocks? According to "norma.pyl," it has "res_layer_num = 10," though? If 5 vs 10 is the case, then RAZ is 2.4 times faster than LZ, which is decent.

I agree with the long-run importance, and I now see the meaning of noise.

Hi @AranKomat

I don't know what the average minibatch size of ReversiAZ is (could you tell me your estimate?)

Although I just enabled this log and watch them,
in the current configuration (parallel_search_num=8), the average batch size is about 1 ~ 2.

It was very small than I thought (I imaged 6 ~ 8).
It meant that several MCTS were waiting for expanding same node.

For testing, I changed virtual_loss from 3 to 15, then the batch size became about 4 ~ 5.
When virutal_loss=30, the batch size became about 5 ~ 7, and the game time is about halved (and not so weak).

In reversi, legal moves are very fewer than Go, now I think it is better that virtual_loss is 30.

if ReversiAZ uses a moderate size of minibatch like 32 or 64 for inference, can we conclude that GIL is the real culprit? Then, can't we just use C++ only at this point by taking a relevant snippet from Leelaz?

I also think using C++ in MCTS will improve the speed.
However, it is hard to predict how fast it will be.
Since reversi is relatively simple processing in the MCTS, it may not be so slow in Python, but if it seems difficult to compute legal moves like chess or shogi, the speed may be quite different.

In my current coroutine implementation, I think there is a simple tuning point.
prediction_worker() is running in the same thread of MCTS,
so even when predicting using GPU, other MTCS is also waiting.
If prediction_worker() is running in the another thread using like run_coroutine_threadsafe(),
the MCTS speed will improve.

However, there are problems like virual_loss above.

it may be better to reduce the num of simulations by 10 fold and achieve x10 iterations

I think that is a good idea worth verifying.

In the discussion of issue-26, it may be better to reduce training data size.
I guess fresh data generated by the current model are important.

However, one thing to notice is that,
why the model can learn from self-play data is that new better value and policy than the model are generated by MCTS.
Too small simulation num would make slow performance improvement.

res_layer_num = 10

In my 'normal' configuration, that's right.

I guess being 2.4 times faster exactly corresponds to the speed advantage with cudnn, so my question is solved, assuming that, when @mokemokechicken said it takes 100s for a single game, it was tested under 'normal' setting.

@mokemokechicken Thanks for your detailed analysis. As you're aware, forward time of CNN with variable batchsize scales weirdly. So, when the minibatch size is 8 vs 16, then the difference in required time is negligible, so that the per-image processing time is roughly twice longer for minibatch size of 8. Of course, since your architecture's FLOPs is huge, you can't really have a very large minibatch. I'm not really familiar with virtual loss yet, but I will definitely try vl=30 as that's an impressive speedup! I'm aware of the importance of the hyperparameter in issue-26.

It seems that playing self-plays is completely sequential in the sense that one game ends and another game starts. In my opinion, it's possible to parallelize it so that you can easily increase the minibatch size rather than naively calling multiple self's. Info of N, W and Q collected in one game are not needed for other games, since the states at later stage are almost never revisited due to the exponentially increasing possible number of configurations. Am I correct? Does increasing parallel_search_num or parallelizing the self-play games for larger minibatch in this sense not really contribute to faster self-play? Maybe any further attempt to parallelize things in AZ is futile, since you've probably already achieved the efficiency that was done in other large-scale group like DeepMind and LeelaZ. So, my question is out of my curiosity.

In Chess, Shogi and Go, the thinking time per move allowed for a human is roughly several minutes. So, the number of simulations per move human performs may not be much larger than 100. A NN+MCTS with 25 sims/move performed on par with a classical MCTS with 500 sims/move in this paper, so this is another (bit tenuous) support for my claim, though the setting is quite different from AZ. If you look at Fig. 5 of the link, for example at iter=0.75e7 the sim/move=1 achieved what sim/move=500 achieved at iter=0.25e7, which makes sim=1 agent 500/(0.75/0.25)=133 times faster until this moment. Of course, its performance soon plateaus, which can be avoided by increasing sims/move, which is opposite to what I suggested before (decreasing sims/move progressively). I understand these are nothing but a heuristic justification. I hope slower yet more policy improvements by MCTS with smaller sims/move and larger iterations will be a practical solution.