influxdata/influxdb

Subqueries run much slower for simple SELECT statement

sparrc opened this issue · 21 comments

Bug report

System info: [Include InfluxDB version, operating system name, and other relevant details]

Amazon Linux
InfluxDB 1.3.6

Steps to reproduce:

  1. Run a query counting and averaging response times:
> SELECT mean(response_time),count(response_time) FROM fastly_logs WHERE time > now() - 1m GROUP BY account_id
name: fastly_logs
tags: account_id=3855502814001
time                         mean   count
----                         ----   -----
2017-11-16T14:03:13.0021277Z 0.0194 5

name: fastly_logs
tags: account_id=4603804221001
time                         mean  count
----                         ----  -----
2017-11-16T14:03:13.0021277Z 0.642 1

name: fastly_logs
tags: account_id=5102072605001
time                         mean  count
----                         ----  -----
2017-11-16T14:03:13.0021277Z 0.016 3

name: fastly_logs
tags: account_id=5130362111001
time                         mean                count
----                         ----                -----
2017-11-16T14:03:13.0021277Z 0.04241873600155608 10269

name: fastly_logs
tags: account_id=5219833342001
time                         mean                count
----                         ----                -----
2017-11-16T14:03:13.0021277Z 0.05869871794871796 156

name: fastly_logs
tags: account_id=5330942432001
time                         mean                count
----                         ----                -----
2017-11-16T14:03:13.0021277Z 0.35833333333333334 6

...
...
  1. Run the same query but within a subquery, selecting some or all elements of the query:
> SELECT mean,count FROM (SELECT mean(response_time),count(response_time) FROM fastly_logs WHERE time > now() - 1m GROUP BY account_id)
name: fastly_logs
time                           mean                 count
----                           ----                 -----
2017-11-16T14:03:35.169381598Z 0.015                8
2017-11-16T14:03:35.169381598Z 0.11424999999999999  5
2017-11-16T14:03:35.169381598Z 0.0625               9
2017-11-16T14:03:35.169381598Z 0.0315               8
2017-11-16T14:03:35.169381598Z 0.23199999999999998  6
2017-11-16T14:03:35.169381598Z                      2
2017-11-16T14:03:35.169381598Z 0.012                1
2017-11-16T14:03:35.169381598Z 0.042096663600633646 18626
2017-11-16T14:03:35.169381598Z 0.05981645569620251  228
2017-11-16T14:03:35.169381598Z 0.4050000000000001   7
2017-11-16T14:03:35.169381598Z 0.004                8
2017-11-16T14:03:35.169381598Z 0.016                4
2017-11-16T14:03:35.169381598Z 0.3765               3
2017-11-16T14:03:35.169381598Z 0.057                5

Expected behavior: [What you expected to happen]

I would expect these queries to take about the same amount of time

Actual behavior: [What actually happened]

the 2nd query takes around 30-60s whereas the first returns almost instantly

Additional info: [Include gist of relevant config, logs, etc.]

I will attach vars.txt and iostat.txt in a comment

👋 again BTW 😄

I can provide sample data to reproduce this if you want, just get in touch with me on Gopher Slack

I notice the same thing. Influxdb-1.3.6 on CentOS 7.2. Even something as simple as:

select mean(value) from table

returns instantly (for 5M values), where

select mean(value) from (select value from table)

takes over a minute.

We have seen this problem too, but I found that by moving the time constraint from the inner to the outer select, the query would complete much faster, and as expected.
In this example I'm wanting to count the number of groups that contain the tested values.

Time in inner select ran in 37 seconds

> time influx -database testdb -execute "select count(sum) from (select sum(util_pc) from host_id where group = 'devel' and name =~ /testing/ and type = 'static' and util_pc > 50 and time > now() - 15m group by ident)"
name: host_id
time count
---- -----
0    15
0.008u 0.004s 0:37.24 0.0%	0+0k 0+0io 0pf+0w

Time in outer select ran in 2 seconds

% time influx -database testdb -execute "select count(sum) from (select sum(util_pc) from host_id where group = 'devel' and name =~ /testing/ and type = 'static' and util_pc > 50 group by ident) where time > now() - 15m"
name: host_id
time                count
----                -----
1516125703050401501 15
0.004u 0.008s 0:02.19 0.0%	0+0k 0+0io 0pf+0w

Original single query ran in 2 seconds

> time influx -database testdb -execute "select sum(util_pc) from host_id where group = 'devel' and name =~ /testing/ and type = 'static' and util_pc > 50 and time > now() - 15m group by ident" 
name: host_id
tags: ident=82459165
time                sum
----                ---
1516125638290016167 199.79999999999995

name: host_id
tags: ident=82459821
time                sum
----                ---
1516125638290016167 399.69999999999993

name: host_id
tags: ident=82459927
time                sum
----                ---
1516125638290016167 400.0999999999999

name: host_id
tags: ident=82460499
time                sum
----                ---
1516125638290016167 399.69999999999993

name: host_id
tags: ident=82460500
time                sum
----                ---
1516125638290016167 372.3999999999999

name: host_id
tags: ident=82460501
time                sum
----                ---
1516125638290016167 341.69999999999993

name: host_id
tags: ident=82460502
time                sum
----                ---
1516125638290016167 268.19999999999993

name: host_id
tags: ident=82460505
time                sum
----                ---
1516125638290016167 399.69999999999993

name: host_id
tags: ident=82460506
time                sum
----                ---
1516125638290016167 400.0999999999999

name: host_id
tags: ident=82460507
time                sum
----                ---
1516125638290016167 400.19999999999993

name: host_id
tags: ident=82460508
time                sum
----                ---
1516125638290016167 399.5999999999999

name: host_id
tags: ident=82527534
time                sum
----                ---
1516125638290016167 393.20000000000005

name: host_id
tags: ident=82527535
time                sum
----                ---
1516125638290016167 393.4
0.004u 0.008s 0:02.36 0.0%	0+0k 0+0io 0pf+0w

I'm running into this in 1.5.3. I've found some detail using EXPLAIN ANALYZE that seems to explain why moving the WHERE clause to the outside speeds things up as noted by @daveselan.

The WHERE time ... clause in the subquery does not seem to be honored. In my case I'm limiting to a 3 minute time span in the subquery, but explain analyze shows that the planner is iterating through all 24 1 hour shards instead of the 1 shard containing the 3 minute time range.

By moving the WHERE clause to the outer query, the query runs much faster and EXPLAIN ANALYZE shows the only the expected shard is iterated.

To make this a bit more concrete, here's the query I want to run. This takes around 45 seconds and creates iterators on 25 shards:
EXPLAIN ANALYZE SELECT TOP("totals", "ip", 7) FROM (SELECT SUM(iops) as "totals" FROM "stats" WHERE "time" > 1529612445s - 3m AND "time" < 1529612445s AND "hostname" = 'myhost.local' GROUP BY "ip")

Moving the WHERE clause to the outside speeds things up to about 3 seconds and create an iterator on one shard only. The same data is returned as above:
EXPLAIN ANALYZE SELECT TOP("totals", "ip", 7) FROM (SELECT SUM(iops) as "totals" FROM "stats" WHERE "hostname" = 'myhost.local' GROUP BY "ip") WHERE "time" > 1529612445s - 3m AND "time" < 1529612445s

This seems like a bug to me, I'd expect the planner to use the WHERE clause to limit the scope of the subquery that has the WHERE clause.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale commented

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

This still is an issue :(

from @trollvottel , this is still an issue in 1.7.9.

I can confirm this issue (running 1.6.3). We're doing lots of queries like this

SELECT sum("mean")
FROM (
    SELECT mean("value") 
    FROM sometable
    WHERE ... $timeFilter
    GROUP BY time($__interval),"host", "interface" fill(null)
)
GROUP BY time($__interval), "host" fill(null)

The inner query itselfs runs pretty fast, but combined it takes ages. It doesn't make a difference if add a second timefilter on the outer query.

Same problem on 1.7.4.
It's a real showstopper

This is scheduled for the 1.8.1 release and should be backported to the 1.7 line as well. It will not be backported to 1.6 so you will need to upgrade.

I don't know about "1.8.1" or "1.7" or "1.6", but I am seeing this issue here:

$ grafana-server -v
Version 6.2.5 (commit: 6082d19, branch: HEAD)

Derp! Thanks!

So, I am still seeing this in Influxdb-1.7.7

Do you know what version in the 1.7 line will receive the backport?

Sorry, I guess that was addressed to @russorat

@dgnorton since 1.8.3 has already been released. Did somebody had time to look into this problem yet?

Waiting for this, too

Hello all.
I already tested this case in 1.8.3 and read the EXPLAIN ANALYZE. And it has the same issue.

Same issue happening in influx 1.8 with this query:
select sum(value) as _value from (select (non_negative_difference(bytes_sent)/(1024.0 * 1024.0)) as value from net where time >= 1623315960000000000 and time <= 1623920760000000000 and orgid='xxxxx' and interface='ens4') where time >= 1623315960000000000 and time <= 1623920760000000000 group by deviceid, interface, time(604800s,32760s) fill(0.0)

Tried removing time constraint in the inner query but still taking more than 20s to run.
Anybody has any workaround subqueries or any other solution?