More precise compute estimates

Question

More precise compute estimates

Opened this issue 2 years ago · 0 comments

From lightvector:

A. Recall from https://arxiv.org/abs/1902.10565 that 75% of the positions in the game are never saved from training. On those, the cheap visit limit is used. So you're going to be substantially underestimating the cost due to not counting compute costs on positions not saved, even though the visit limit on those is low.

B. There's a thing where stochastically some positions are saved out more than once, and some not at all (policy and value surprise weighting), but the computation for doing that should preserve the d in expectation, so I think this is neutral in expectation for your purpose, neither causing you to over nor under estimate.

C. KataGo used 600 full / 100 cheap for roughly the first 1-2 days of training (roughly up through b10c128 and maybe between 1/2 and 1/4 of b15c192), 1000 full / 200 cheap for the rest of g170 (i.e. all the kata1 models that were imported from the former run g170 that was done on private hardware alone, before that run became the prefix for the current distributed run kata1), and then 1500 full / 250 cheap for all of distributed training so far. So you'll need to use the appropriate cutoffs for visits by model range.

D. The cheap search limit is also used even for rows that are saved, once the winrate is sufficiently extreme, to save a bit on compute when playing out long endgames. However, the probability of writing the row decreases too, so I think this somewhat cancels out, but not entirely.

E. There's a neural net cache that reuses old queries, which is used if on a future turn you visit the same node that you already searched on the previous turn, or if multiple move sequences in a search lead to the same position. I think this typically saves somewhere between 20% and 50% of the cost of a search relative to a naive estimate based on the number of visits. So that means you're overestimating here.

F. A lot of KataGo's games are on smaller board sizes than 19x19. One could save some cost by only using smaller tensors in those cases, in practice that optimization wasn't implemented due to batching. So I think this is more just of a note, about where theoretical flops diverges significantly from practical flops due to practical engineering considerations. In the future I might implement this optimization in a way that still works dynamically with batching.

G. I think your notebook is greatly underestimating cost due to missing a ton of the networks from g170. They were only sparsely copied over on to katagotraining.org, since there was no value having all of them. You'll need to get the full list from https://katagoarchive.org/

H. You're both overcounting and undercounting a little in your notebook at different points due to not accounting for the fact that sometimes there is more than one model being jointly trained with another. I.e. one model generates the data, but both models train on it. So naively sorting the models by the d and multiplying size by change in d won't work, you need to filter out the models that are not generating the data at a given time.

For empirically measuring A-E, I think one representative net of each size midway through its usage is probably okay. There's a possibility that nets of different sizes or at different overall points through the run have different characteristics about how fast the game is decided (thereby affecting the % of game played at reduced visits due to extreme winrate) as well as how much caching there is (in general, the sharper the policy, the more caching), and both effects are going to be nonlinear in the number of visits, so that's why you'd want to do more than one measurement from different points in the run using the 600/100 or 1000/200 or 1500/250 as appropriate.
G and H you can handle simply by doing your notebook using the full g170 model list and filtering models properly. The active one producing the data is always the one that's densest on that interval. Also there's a long interval in g170 where both b40c256 and b30c320 were both active - whichever net was the latest was the one that was running.