NatLibFi/Annif

Refactor: use sparse matrices instead of ListSuggestionResult and VectorSuggestionResult

osma opened this issue · 0 comments

osma commented

Currently the annif.suggestion module defines two main classes for representing suggestion results, ListSuggestionResult and VectorSuggestionResult. Backends generally use either of these (usually not both) when returning results.

But now that we have started implementing batched suggestions in backends, these representations have become a bit awkward, as they are specific to one document and perhaps also needlessly complicated (often suggestions have to be converted back and forth between the two representations).

I think it would make sense to try to replace both of these classes with a single class, maybe called SuggestionBatch, that can represent the suggestion results for a whole batch of documents (up to 32). It could be implemented using a SciPy sparse matrix or perhaps a sparse array, as arrays are nowadays recommended by SciPy.

This is a rather intrusive refactoring:

  • the new SuggestionBatch class must be implemented (replacing the old classes)
  • all backends have to be modified to return suggestions in the new format
  • annif.eval (in particular EvaluationBatch) must be modified to process suggestions in the new format
  • AnnifProject.suggest (and related methods) are affected
  • annif.cli.run_suggest and run_index functions are affected
  • annif.rest.suggest is affected too (and the batched variant if implemented - see #664)
  • many unit tests must be modified to match the new data structures

Regarding how to tackle this: I think it would make sense to implement this first on the "outside", that is, changing the return type of AnnifProject.suggest to the new representation, since it already returns a batch, though currently it's a list of {List,Vector}SuggestionResult objects. The eval, CLI and REST code that relies on AnnifProject.suggest must of course be changed too. After this is working, the last step is to change the backends to return the new representation.