Data parallelization
ChristianFredrikJohnsen opened this issue · 1 comments
There is a lot of sequential logic in our code, and thus we are not utilizing the GPU to its full potential. But even if we do use the GPU, we are bottlenecked by the fact that we are constantly doing single-tensor memory transfer to the GPU all the time during select. When doing this step thousands of times per move, the time spent doing memory transfer adds up.
Ideally, we want to have the state tensors cached in GPU such that we don't need to transfer memory as often, but it is surprisingly difficult to achieve.
TODO:
Find out how to reduce the amount of memory transfers per run.
Batch processing is an idea from supervised learning, but not so easy to implement here.
Recently updated the node-class and vectorized select. The code is much less sequential now, and select has now gone down from something like 250µs per hit to 50µs per hit when training in connect4.
Check commit 7477ba2
The parallelization happens on the CPU though, the GPU is twice as slow, probably because the tensors are so tiny. We are talking about vectors with 10 elements which you do some multiply and division on.