NatLibFi/Annif

Better support for suggestion batches in NN ensemble

osma opened this issue · 0 comments

osma commented

As noted in PR #681 ("Potential future work"), the way NN ensemble handles batches could be improved:

I'm not quite happy with how the NN ensemble handles suggestion results from other projects, both during training and suggest operations. For example, the training samples are stored in LMDB one document at a time, but now it would be easier to store them as whole batches instead, which could be more efficient. But I decided that this PR is already much too big and it would make sense to try to improve batching in the NN ensemble in a separate follow-up PR. There is already an attempt to do part of this in PR #676; that could be a possible starting point.

In particular:

  • training documents could be processed by using batch operations on source projects; there was an attempt to do this in PR #676
  • training data is currently stored in LMDB one document at a time; it would make sense to store them as batches instead (and perhaps use another data storage mechanism, e.g. TF Data / Dataset)
  • _merge_source_batches could perform calculations using sparse arrays and only convert to NumPy arrays at the end (and transpose if necessary)

Of course the changes need to be properly benchmarked.