G-Research/spark-dgraph-connector

Support filter pushdown

EnricoMi opened this issue · 0 comments

Projections (select columns) and filters (where clauses) in Spark performed on the DataFrame provided by the connector can be pushed into the connector so that only relevant data are read from the Dgraph database. Otherwise (in current implementation), the entire Dgraph data is loaded into Spark first and then filtered and projected.

Filter pushdown allows to read only the relevant sub-graph, which improves read performances for tiny sub-graphs. There are various types of pushdowns available:

  • select uids
  • select predicates
  • select values
  • value type (including objectUid)
  • select columns

Select Uids

Uids (subject column) can be selected by where operation on the DataFrame using = and isin. Dgraph does not support < or > on uids.

Select Predicates

Predicates can be selected through select on the wide nodes DataFrame or where on the predicate column in all other DataFrames returned by the connector. The where operation should support = and isin.

Select Values

Objects can be filtered with where using equality =.

Value Type (including objectUid)

Object type can be filtered by where and is not null on typed DataFrames like typed triples and typed nodes. This would then filter for Dgraph predicates that have the respective type only.

Select Columns

The connector can only provide those columns of a returned DataFrame that are actually used later by Spark. This reduces the amount of data transfered to Spark.