G-Research/spark-dgraph-connector

Predicate-size uid partitioning

Opened this issue · 0 comments

Given a predicate partitioning, we can further partition each partition orthogonally by uids. Some partitions may contain more rows than others. By splitting large partitions into more parts than smaller ones, we can achieve a more even final partitioning.

The zero service provides predicate size statistics. With these, we can compute the size of each partition. This does not reflect the number of rows, but the size of the predicate values, where string, geo, password and default refer to variable size predicates. This makes estimating the number of rows difficult. Compression might also make estimation more complex. However, we can see this size as a transfer-cost estimate and make the final partitioning even-sized regarding that metric.