InfluxCommunity/influxdb-ruby

Count of results returned by query() is not equal to the number of queries passed

Closed this issue · 11 comments

We can pass multiple queries to InfluxDB::Client#query by concatenating them, separated by a ;.
If we do that, we expect the number of results returned to be the same as the number of queries passed. This is currently not the case.

This also affects the result of a single query that uses a GROUP BY clause on a tag. In that case the number of results returned is equal to the number of distinct tags returned by the GROUP BY operation.

This breaks all the expectations and prevents useful batching of multiple queries especially when intermediate queries may return no data.

@hogaur @satyanash

Add InfluxDB::Client#batched_query to fix #217 and close #218. It takes an array of query strings to be run in a single HTTP request to Influx.

dmke commented

Thanks for reporting and looking into this. I will have a look as well when I'm back home (in about 3-4 hours).

dmke commented

I'm currently integrating your PR into the pr/218 branch. I have moved the batch related methods and specs into seperate files and fixed the rubocop issues.

Before this lands in master, I'll also add a smoke test to actually perform multiple queries against a running InfluxDB server (this might take a moment; please refrain from forking off the pr/218 branch, as I will force-push while fighting with Travis ;-))

dmke commented

After using it a bit, I find the distinction between #query and #batched_query not quite usable... I'm in favour of giving #query the ability to do the right thing™, when we pass an array of queries.

As you've also discovered, this causes an ambiguity with GROUP BY clauses. But that could be mitigated, if we add the statement_id response parameter into the return value, like so:

client.query(["select * from foo", "select * from foo group by bar"])
#=> [ {"statement_id" => 0, "name" => "foo", "tags" => nil,              "values" => [...] },
#     {"statement_id" => 1, "name" => "foo", "tags" => { "bar" => "a" }, "values" => [...] },
#     {"statement_id" => 1, "name" => "foo", "tags" => { "bar" => "b" }, "values" => [...] } ]

Would that be a workable solution for you?

I agree that #batched_query may not be the best way. We went with creating #batched_query so that we didn't break the existing behaviour for #query.

How about changing #query to always return an array of arrays(series) when an array of queries is supplied. Then it's trivial to check if the inner array is .empty? for when no data is returned. GROUP BYs are also nicely contained in the inner array since it isn't flattened. It also removes the need for #query to return the statement_id which until now wasn't exposed to the caller.

How do you propose to support query parameter substitution when multiple queries are passed? #batched_query does not take params: in the above implementation.

dmke commented

I like the "Array-in Array-out" approach, it extends the existing API nicely.

For parameter substitution I'd say, you'll need to supply an array of parameters (where "parameters" designates an array of positional arguments or a hashes with named arguments). Let the behaviour of passing a simple array/hash be undefined for now. The naïve implementation then could be:

- query = builder.build(query, params)
+ query = [*query].map_with_index {|q, i|
+   builder.build q.chomp(";"), params[i]
+ }

where builder.build raises an execption when params[i] does not support #[] for indexing.


That being said (and having slept over this issue), I wonder if extending #query is the right approach, as it encapsulates too many responsiblities in one method (this isn't Rails after all ;-)). Batching should be seperate from (simple) querying.

Therefore, I suggest the following API:

batch = client.batch do |b|
  b.add(query1, params: params1) #=> 0 (statement id)
  b.add(query2) #=> 1
end

# adding queries outside block
batch.add(query3) #=> 2

# fetch series and return results
batch.execute #=> [results1, results2, results3]

# or with block
batch.execute do |statement_id, name, tags, values|
end

where batch.add inherits the semantics from client.query, with one notable exception: it doesn't perform the HTTP request, and returns a statement ID. Calling client.fetch_series and processing the result then happens later in batch.execute (if batch.execute is stateless, it should be possible to call it multiple times).

This still has some kinks (:denormalize and :chunk_size should not be acceptible by batch.add, but either by client.batch or batch.execute), but I believe it's doable.

it encapsulates too many responsiblities in one method

Exactly, even rubocop was complaining that the file had too many lines 😀.

The batch object which separates construction and execution looks like a good way to have #query like semantics while still accepting multiple queries and returning multiple results.

Do we have access to the statement_id before we send it to Influx? Or are we just assuming that the order in which a query was added will be the order in which Influx will return statement_ids?

dmke commented

We can safely predict the statement ID:

StatementID is just the statement's position in the query

(see query.Result type, the ID corresponds to the control variable i in a simple loop).

Great, looks like we have a decent contract now.

dmke commented

I've merged a preliminary version into the master branch. Feel free to play around with this for a few days. Documentation is still missing though (but will follow, before I push 0.6.0 to rubygems).

dmke commented

One slight problem: Batching now requires InfluxDB >= 1.2.0, as the statement ID was not available before that version.

(Note to self: add this fact to the readme)