cmdevries/LMW-tree

Implement mini-batch stochastic gradient descent parallel streaming EM-tree

cmdevries opened this issue · 1 comments

With a large enough mini-batch size, parallelism can be exploited within the mini-batch of say 1 million signatures.

Hopefully this will converge in 1 iteration or less for excessively large datasets like ClueWeb.

This also works in a distributed setting by simply batching and broadcasting updates to the tree as the parallel mini-batches proceed.

Fast randomization of the input signature file is also useful.

Done. This converges to a solution as good as batch streaming EM-tree after 4 iterations with just 1 iteration over the data. This is very close to minimize the error as much as full convergences of batch streaming EM-tree.

832ff9f