Data points aggregates are slow
wjoel opened this issue · 4 comments
Our data points aggregates should be as fast as anything else, yet this takes more than an hour:
dp_df = dp.select("value", "timestamp", "name").where(dp.name.isin(["VAL_11-PT-92117:X.Value","VAL_11-PT-92117:X.Value"])).where(dp.aggregation.isin(['avg'])).filter(dp.granularity == '5s')
print datetime.datetime.utcnow(); dp_df.count(); print datetime.datetime.utcnow()
2019-03-13 08:02:00.015129
38739328
2019-03-13 09:11:27.553756
Meanwhile, a simple Python script using the Cognite SDK's client.datapoints.get_datapoints_frame
with aggregates=['avg']
, granularity='5s'
, start=datetime.datetime(2012,10,10)
, end=datetime.datetime(2019,3,12)
finishes in 1-3 minutes with 38706662 results.
Figure out why we're so much slower, and fix it. Are we looping incorrectly? Are we downloading way too few aggregates per call?
The Python SDK splits the time window among several workers using https://github.com/cognitedata/cognite-sdk-python/blob/5b32fe42a1f2555ed66a5018e0b66b20a5f2a705/cognite/client/_utils.py#L153
Perhaps it's as simple as that, and we should do this as well?
New run with Spark was a bit faster:
2019-03-13 09:26:45.627573
38710109
2019-03-13 10:00:59.967290
Perhaps splitting reduces hotspotting as well, otherwise splitting the time window among a handful of workers doesn't seem like it would improve it so dramatically.
The default value is 10 workers, so perhaps it's that simple after all.