cognitedata/cdp-spark-datasource

Optimize pushdown logic

Closed this issue · 1 comments

The current implementation of pushdown filters will extract all pushable columns from a where clause, and create a single request for each filter. The union is read from CDF and duplicates are removed. Spark then applies the full filter (including pushable filters) to the data.

This is not quite optimal, and may in some cases be slower than having no pushdown at all:
Consider the filter WHERE(col1 = 'a' and col2 = 'b') which is translated into two requests, one reading all rows where col1 = 'a', and one reading all where col2 = 'b'. If both these always hold our filter will read the entire set twice.

As long as the API does not support OR we will never reach optimal, but it does support AND which opens up for further optimization.