Implement mini-batch stochastic gradient descent parallel streaming EM-tree
cmdevries opened this issue · 1 comments
cmdevries commented
With a large enough mini-batch size, parallelism can be exploited within the mini-batch of say 1 million signatures.
Hopefully this will converge in 1 iteration or less for excessively large datasets like ClueWeb.
This also works in a distributed setting by simply batching and broadcasting updates to the tree as the parallel mini-batches proceed.
Fast randomization of the input signature file is also useful.