Batch suggest operation
osma opened this issue · 3 comments
Currently all the suggest
methods (CLI command, REST API method, project and backend methods) always take just one document at a time. This is inefficient for backends that could process many documents in parallel.
We should introduce a batch version of suggest
(called e.g. suggest_batch
or suggest_many
unless someone has a better idea?) for each of these contexts. Individual backends can then choose to implement it when it gives a performance boost; otherwise, the batch is simply passed to the regular suggest
method one document at a time. I believe that at least NN ensemble, SVC, fastText and MLLM backends could benefit from parallel suggest operations. Also, this would be very useful for the proposed XTransformer backend.
A note on scope: This issue is about implementing the scaffolding necessary for batching suggest operations, as well as using them in at least some (not necessarily all) operations that would benefit from it: e.g. eval, hyperopt, optimize, index. Changes to individual backends are out of scope but separate issues for them should be opened after this basic scaffolding is in place.
I played around with this a bit, and some questions arose.
CLI usage
- If a new
suggest-batch
CLI command is added, from where should the documents be loadable when using it? From a directory (of text documents), or any paths, maybe to already indexed TSV files? - Alternatively, maybe the existing
suggest
CLI command could be turned to use batched processing if it is given a path to documents instead of stdin feed? - To where output the suggestions when using
suggest-batch
(orsuggest
) for multiple documents?- Opt 1: To
<doc-filename>.annif
files similarly as theindex
command does, but then this just duplicates theindex
command function (and seems reasonable only for the directory input) - Opt 2: To stdout like the current
suggest
does, separating documents by first showing the document name and then on the following lines the subject suggestions for the document, e.g.:
tests/corpora/archaeology/fulltext/440866.txt <http://www.yso.fi/onto/yso/p6218> riimukirjoitus 0.3213897943496704 <http://www.yso.fi/onto/yso/p6479> viikingit 0.18659920990467072 <http://www.yso.fi/onto/yso/p12738> viikinkiaika 0.18625082075595856 <http://www.yso.fi/onto/yso/p22768> Kiinan muuri 0.15950888395309448 <http://www.yso.fi/onto/yso/p3973> antiikki 0.13840530812740326 <http://www.yso.fi/onto/yso/p14588> riimukivet 0.1362432837486267 <http://www.yso.fi/onto/yso/p14173> kaivaukset 0.1201547235250473 <http://www.yso.fi/onto/yso/p5713> hautalöydöt 0.11249098181724548 <http://www.yso.fi/onto/yso/p15031> viikinkiretket 0.11039584875106812 <http://www.yso.fi/onto/yso/p5714> muinaishaudat 0.10336380451917648 tests/corpora/archaeology/fulltext/441563.txt <http://www.yso.fi/onto/yso/p4625> pronssikausi 0.33119136095046997 <http://www.yso.fi/onto/yso/p4622> esihistoria 0.2926081418991089 <http://www.yso.fi/onto/yso/p1265> arkeologia 0.24922890961170197 <http://www.yso.fi/onto/yso/p20096> kansainvaellusaika 0.23529952764511108 <http://www.yso.fi/onto/yso/p9285> neoliittinen kausi 0.23072052001953125 <http://www.yso.fi/onto/yso/p2558> rautakausi 0.2238517701625824 <http://www.yso.fi/onto/yso/p4626> varhaismetallikausi 0.2232591211795807 <http://www.yso.fi/onto/yso/p10849> arkeologit 0.2182117998600006 <http://www.yso.fi/onto/yso/p7751> kampakeraaminen kulttuuri 0.21752358973026276 <http://www.yso.fi/onto/yso/p14173> kaivaukset 0.21643799543380737
- Opt 1: To
- The other CLI commands (eval, hyperopt, optimize, index) are intended to operate on multiple documents so the output question does not concern them
REST usage:
- The best way to pass in the documents seems to use
application/json
encoding as in the learn method, but then how to pass parameters (language, limit, threshold)? I dont see a way to pass them the same way as for suggest method (which usesapplication/x-www-form-urlencoded
)? Maybe put the parameters as an object in the json together with the documents:[ { "parameters": [ { "language": "string", "limit": 10, "threshold": 0 } ] }, { "documents": [ { "text": "A quick brown fox jumped over the lazy dog." } ] } ]
CLI
If a new suggest-batch CLI command is added, from where should the documents be loadable when using it? From a directory (of text documents), or any paths, maybe to already indexed TSV files?
Alternatively, maybe the existing suggest CLI command could be turned to use batched processing if it is given a path to documents instead of stdin feed?
My hunch would be to try to extend the current suggest
CLI command instead of defining a new suggest-batch
command. The current suggest
command expects input from stdin; maybe we could change it so it works more like the cat
command and other similar *nix tools, i.e. it could take one or more filenames as a parameter, but fall back to stdin
if no file names are given. So you could do e.g.
annif suggest yso-tfidf-en <document.txt # just like before
annif suggest yso-tfidf-en document.txt # the same, but from a named file
annif suggest yso-tfidf-en doc1.txt doc2.txt doc3.txt # many files
annif suggest yso-tfidf-en doc*.txt # similar to above, but using shell expansion
Opt 2: To stdout like the current suggest does, separating documents by first showing the document name and then on the following lines the subject suggestions for the document, e.g.:
I think this is the way to go. For easier grepping etc., I would perhaps add some kind of extra tag in addition to the filename, something like:
Suggestions for tests/corpora/archaeology/fulltext/440866.txt
<http://www.yso.fi/onto/yso/p6218> riimukirjoitus 0.3213897943496704
<http://www.yso.fi/onto/yso/p6479> viikingit 0.18659920990467072
<http://www.yso.fi/onto/yso/p12738> viikinkiaika 0.18625082075595856
<http://www.yso.fi/onto/yso/p22768> Kiinan muuri 0.15950888395309448
<http://www.yso.fi/onto/yso/p3973> antiikki 0.13840530812740326
<http://www.yso.fi/onto/yso/p14588> riimukivet 0.1362432837486267
<http://www.yso.fi/onto/yso/p14173> kaivaukset 0.1201547235250473
<http://www.yso.fi/onto/yso/p5713> hautalöydöt 0.11249098181724548
<http://www.yso.fi/onto/yso/p15031> viikinkiretket 0.11039584875106812
<http://www.yso.fi/onto/yso/p5714> muinaishaudat 0.10336380451917648
I think it would be logical to use this output format whenever named files are used (instead of stdin), even if there is just one file.
The other CLI commands (eval, hyperopt, optimize, index) are intended to operate on multiple documents so the output question does not concern them
Yes, and I think this is where we could expect the most benefits. For example eval
could potentially be much faster with some backends if it can use batched processing internally, even if it doesn't change anything from the user perspective so the command itself and its output remain the same.
REST
The best way to pass in the documents seems to use application/json encoding
I agree that JSON encoding seems like a good choice here, but there are other options that perhaps shouldn't be dismissed outright:
- It's possible to use old fashioned
application/x-www-form-urlencoded
encoding, like the currentsuggest
methods. There could be a fieldtext
ortexts
that is defined as an array in the OpenAPI spec (see Describing Request Body, section "Form Data"). In practice, this would mean that the values are repeated in the encoded body, like this:limit=10&threshold=0.2&text=doc1&text=doc2&text=doc3
(heredoc1
,doc2
anddoc3
are placeholders for document text). For me, the main attraction of this would be that it may allow extending the currentsuggest
method without defining a newsuggest-batch
method; though I suspect that the return data format would have to be different anyway, so maybe it would just create confusion. - It's also possible to use multipart requests where each document is a separate part, although I don't think we want to go there.
If we go for the JSON encoding, the parameters could also be given as URL parameters: POST /projects/yso-tfidf-en/suggest?limit=10&threshold=0.2
, though I'm unsure if this is any better than just passing them in the JSON.
A little nitpick: why did you use an array here? I think a single object would be enough.
"parameters": [
{
"language": "string",
"limit": 10,
"threshold": 0
}
]
FWIW, I also checked the Maui Server API, but it doesn't have a batched version of suggest
that we could copy.