euphoria-core: Sort per partition
t-novak opened this issue · 2 comments
t-novak commented
By now euphoria is allowing per-partition sorting only with explicit partitioner like:
Sort.of(Dataset<IN>)
.by(UnaryFunction<IN, S>)
.setPartitioner(Partitioner<S>)
- Why is the partitioner working on sorting feature
S
instead of input elementIN
? It's fine forRangePartitioner
but complicated in other usage. - Why implicit partitioner (defined in previous operation) cannot be used? I believe Spark is preserving partitioning on "actions".
- Dataset is shuffled again because of new distribution of partitions is created here although the dataset was partitioned before? Partitioners in Spark are compared by
equals
so it's easy to check if dataset is partitioned correctly.
je-ik commented
Unfortunately, the Sort
operator as it is defined now will very probably be dropped entirely. This is due to the fact, that whole partitioning abstraction is not working well with Apache Beam. If you want to do operation like "repartitionAndSortWithinPartitions", there will be only one option how to do it:
- assign each element a partitionId
- do a reduce-by-key with this partitionId as a key
- and use external sort on the supplied Iterable over all elements in the same partition
This implementation can be generalized (maybe it could replace the current implementation of the Sort
operator), but it is not now, unfortunately. Would you have interest in implementing it?