NatLibFi/Annif

Batch suggest operation

osma opened this issue · 3 comments

osma commented

Currently all the suggest methods (CLI command, REST API method, project and backend methods) always take just one document at a time. This is inefficient for backends that could process many documents in parallel.

We should introduce a batch version of suggest (called e.g. suggest_batch or suggest_many unless someone has a better idea?) for each of these contexts. Individual backends can then choose to implement it when it gives a performance boost; otherwise, the batch is simply passed to the regular suggest method one document at a time. I believe that at least NN ensemble, SVC, fastText and MLLM backends could benefit from parallel suggest operations. Also, this would be very useful for the proposed XTransformer backend.

A note on scope: This issue is about implementing the scaffolding necessary for batching suggest operations, as well as using them in at least some (not necessarily all) operations that would benefit from it: e.g. eval, hyperopt, optimize, index. Changes to individual backends are out of scope but separate issues for them should be opened after this basic scaffolding is in place.

I played around with this a bit, and some questions arose.

CLI usage

  • If a new suggest-batch CLI command is added, from where should the documents be loadable when using it? From a directory (of text documents), or any paths, maybe to already indexed TSV files?
  • Alternatively, maybe the existing suggest CLI command could be turned to use batched processing if it is given a path to documents instead of stdin feed?
  • To where output the suggestions when using suggest-batch (or suggest ) for multiple documents?
    • Opt 1: To <doc-filename>.annif files similarly as the index command does, but then this just duplicates the index command function (and seems reasonable only for the directory input)
    • Opt 2: To stdout like the current suggest does, separating documents by first showing the document name and then on the following lines the subject suggestions for the document, e.g.:
    tests/corpora/archaeology/fulltext/440866.txt
    <http://www.yso.fi/onto/yso/p6218>	riimukirjoitus	0.3213897943496704
    <http://www.yso.fi/onto/yso/p6479>	viikingit	0.18659920990467072
    <http://www.yso.fi/onto/yso/p12738>	viikinkiaika	0.18625082075595856
    <http://www.yso.fi/onto/yso/p22768>	Kiinan muuri	0.15950888395309448
    <http://www.yso.fi/onto/yso/p3973>	antiikki	0.13840530812740326
    <http://www.yso.fi/onto/yso/p14588>	riimukivet	0.1362432837486267
    <http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.1201547235250473
    <http://www.yso.fi/onto/yso/p5713>	hautalöydöt	0.11249098181724548
    <http://www.yso.fi/onto/yso/p15031>	viikinkiretket	0.11039584875106812
    <http://www.yso.fi/onto/yso/p5714>	muinaishaudat	0.10336380451917648
    tests/corpora/archaeology/fulltext/441563.txt
    <http://www.yso.fi/onto/yso/p4625>	pronssikausi	0.33119136095046997
    <http://www.yso.fi/onto/yso/p4622>	esihistoria	0.2926081418991089
    <http://www.yso.fi/onto/yso/p1265>	arkeologia	0.24922890961170197
    <http://www.yso.fi/onto/yso/p20096>	kansainvaellusaika	0.23529952764511108
    <http://www.yso.fi/onto/yso/p9285>	neoliittinen kausi	0.23072052001953125
    <http://www.yso.fi/onto/yso/p2558>	rautakausi	0.2238517701625824
    <http://www.yso.fi/onto/yso/p4626>	varhaismetallikausi	0.2232591211795807
    <http://www.yso.fi/onto/yso/p10849>	arkeologit	0.2182117998600006
    <http://www.yso.fi/onto/yso/p7751>	kampakeraaminen kulttuuri	0.21752358973026276
    <http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.21643799543380737
    
  • The other CLI commands (eval, hyperopt, optimize, index) are intended to operate on multiple documents so the output question does not concern them

REST usage:

  • The best way to pass in the documents seems to use application/json encoding as in the learn method, but then how to pass parameters (language, limit, threshold)? I dont see a way to pass them the same way as for suggest method (which uses application/x-www-form-urlencoded)? Maybe put the parameters as an object in the json together with the documents:
      [
        {
          "parameters": [
            {
              "language": "string",
              "limit": 10,
              "threshold": 0
            }
          ]
        },
        {
          "documents": [
            {
              "text": "A quick brown fox jumped over the lazy dog."
            }
          ]
        }
      ]
    
osma commented

CLI

If a new suggest-batch CLI command is added, from where should the documents be loadable when using it? From a directory (of text documents), or any paths, maybe to already indexed TSV files?
Alternatively, maybe the existing suggest CLI command could be turned to use batched processing if it is given a path to documents instead of stdin feed?

My hunch would be to try to extend the current suggest CLI command instead of defining a new suggest-batch command. The current suggest command expects input from stdin; maybe we could change it so it works more like the cat command and other similar *nix tools, i.e. it could take one or more filenames as a parameter, but fall back to stdin if no file names are given. So you could do e.g.

annif suggest yso-tfidf-en <document.txt              # just like before
annif suggest yso-tfidf-en document.txt               # the same, but from a named file
annif suggest yso-tfidf-en doc1.txt doc2.txt doc3.txt # many files
annif suggest yso-tfidf-en doc*.txt                   # similar to above, but using shell expansion

Opt 2: To stdout like the current suggest does, separating documents by first showing the document name and then on the following lines the subject suggestions for the document, e.g.:

I think this is the way to go. For easier grepping etc., I would perhaps add some kind of extra tag in addition to the filename, something like:

Suggestions for tests/corpora/archaeology/fulltext/440866.txt
<http://www.yso.fi/onto/yso/p6218>	riimukirjoitus	0.3213897943496704
<http://www.yso.fi/onto/yso/p6479>	viikingit	0.18659920990467072
<http://www.yso.fi/onto/yso/p12738>	viikinkiaika	0.18625082075595856
<http://www.yso.fi/onto/yso/p22768>	Kiinan muuri	0.15950888395309448
<http://www.yso.fi/onto/yso/p3973>	antiikki	0.13840530812740326
<http://www.yso.fi/onto/yso/p14588>	riimukivet	0.1362432837486267
<http://www.yso.fi/onto/yso/p14173>	kaivaukset	0.1201547235250473
<http://www.yso.fi/onto/yso/p5713>	hautalöydöt	0.11249098181724548
<http://www.yso.fi/onto/yso/p15031>	viikinkiretket	0.11039584875106812
<http://www.yso.fi/onto/yso/p5714>	muinaishaudat	0.10336380451917648

I think it would be logical to use this output format whenever named files are used (instead of stdin), even if there is just one file.

The other CLI commands (eval, hyperopt, optimize, index) are intended to operate on multiple documents so the output question does not concern them

Yes, and I think this is where we could expect the most benefits. For example eval could potentially be much faster with some backends if it can use batched processing internally, even if it doesn't change anything from the user perspective so the command itself and its output remain the same.

REST

The best way to pass in the documents seems to use application/json encoding

I agree that JSON encoding seems like a good choice here, but there are other options that perhaps shouldn't be dismissed outright:

  1. It's possible to use old fashioned application/x-www-form-urlencoded encoding, like the current suggest methods. There could be a field text or texts that is defined as an array in the OpenAPI spec (see Describing Request Body, section "Form Data"). In practice, this would mean that the values are repeated in the encoded body, like this: limit=10&threshold=0.2&text=doc1&text=doc2&text=doc3 (here doc1, doc2 and doc3 are placeholders for document text). For me, the main attraction of this would be that it may allow extending the current suggest method without defining a new suggest-batch method; though I suspect that the return data format would have to be different anyway, so maybe it would just create confusion.
  2. It's also possible to use multipart requests where each document is a separate part, although I don't think we want to go there.

If we go for the JSON encoding, the parameters could also be given as URL parameters: POST /projects/yso-tfidf-en/suggest?limit=10&threshold=0.2, though I'm unsure if this is any better than just passing them in the JSON.

A little nitpick: why did you use an array here? I think a single object would be enough.

      "parameters": [
        {
          "language": "string",
          "limit": 10,
          "threshold": 0
        }
      ]

FWIW, I also checked the Maui Server API, but it doesn't have a batched version of suggest that we could copy.

The functionality this issue addressed was implemented by PRs #663 and #664.

Issues for implementing the batch functionality in individual backends have been opened and some of them have already been closed.