cognitedata/cdp-spark-datasource

Data points aggregates are slow

wjoel opened this issue · 4 comments

wjoel commented

Our data points aggregates should be as fast as anything else, yet this takes more than an hour:

dp_df = dp.select("value", "timestamp", "name").where(dp.name.isin(["VAL_11-PT-92117:X.Value","VAL_11-PT-92117:X.Value"])).where(dp.aggregation.isin(['avg'])).filter(dp.granularity == '5s')
print datetime.datetime.utcnow(); dp_df.count(); print datetime.datetime.utcnow()

2019-03-13 08:02:00.015129
38739328
2019-03-13 09:11:27.553756

Meanwhile, a simple Python script using the Cognite SDK's client.datapoints.get_datapoints_frame with aggregates=['avg'], granularity='5s', start=datetime.datetime(2012,10,10), end=datetime.datetime(2019,3,12) finishes in 1-3 minutes with 38706662 results.

Figure out why we're so much slower, and fix it. Are we looping incorrectly? Are we downloading way too few aggregates per call?

wjoel commented

The Python SDK splits the time window among several workers using https://github.com/cognitedata/cognite-sdk-python/blob/5b32fe42a1f2555ed66a5018e0b66b20a5f2a705/cognite/client/_utils.py#L153

Perhaps it's as simple as that, and we should do this as well?

wjoel commented

New run with Spark was a bit faster:

2019-03-13 09:26:45.627573
38710109
2019-03-13 10:00:59.967290

Perhaps splitting reduces hotspotting as well, otherwise splitting the time window among a handful of workers doesn't seem like it would improve it so dramatically.

wjoel commented

The default value is 10 workers, so perhaps it's that simple after all.

wjoel commented

Fixed by #193 with further improvements to come when #195 is fixed.