Data points aggregates are slow

Question

Data points aggregates are slow

wjoel opened this issue 6 years ago · 4 comments

Our data points aggregates should be as fast as anything else, yet this takes more than an hour:

dp_df = dp.select("value", "timestamp", "name").where(dp.name.isin(["VAL_11-PT-92117:X.Value","VAL_11-PT-92117:X.Value"])).where(dp.aggregation.isin(['avg'])).filter(dp.granularity == '5s')
print datetime.datetime.utcnow(); dp_df.count(); print datetime.datetime.utcnow()

2019-03-13 08:02:00.015129
38739328
2019-03-13 09:11:27.553756

Meanwhile, a simple Python script using the Cognite SDK's client.datapoints.get_datapoints_frame with aggregates=['avg'], granularity='5s', start=datetime.datetime(2012,10,10), end=datetime.datetime(2019,3,12) finishes in 1-3 minutes with 38706662 results.

Figure out why we're so much slower, and fix it. Are we looping incorrectly? Are we downloading way too few aggregates per call?

Answer 1 · 2019-03-13T09:57:06.000Z

The Python SDK splits the time window among several workers using https://github.com/cognitedata/cognite-sdk-python/blob/5b32fe42a1f2555ed66a5018e0b66b20a5f2a705/cognite/client/_utils.py#L153

Perhaps it's as simple as that, and we should do this as well?

Answer 2 · 2019-03-13T10:02:01.000Z

New run with Spark was a bit faster:

2019-03-13 09:26:45.627573
38710109
2019-03-13 10:00:59.967290

Perhaps splitting reduces hotspotting as well, otherwise splitting the time window among a handful of workers doesn't seem like it would improve it so dramatically.

Answer 3 · 2019-03-13T10:10:03.000Z

The default value is 10 workers, so perhaps it's that simple after all.

Answer 4 · 2019-03-20T10:14:10.000Z

Fixed by #193 with further improvements to come when #195 is fixed.