seznam/euphoria

euphoria-core: Sort per partition

t-novak opened this issue · 2 comments

By now euphoria is allowing per-partition sorting only with explicit partitioner like:

Sort.of(Dataset<IN>)
        .by(UnaryFunction<IN, S>)
        .setPartitioner(Partitioner<S>)
  1. Why is the partitioner working on sorting feature S instead of input element IN? It's fine for RangePartitioner but complicated in other usage.
  2. Why implicit partitioner (defined in previous operation) cannot be used? I believe Spark is preserving partitioning on "actions".
  3. Dataset is shuffled again because of new distribution of partitions is created here although the dataset was partitioned before? Partitioners in Spark are compared by equals so it's easy to check if dataset is partitioned correctly.
je-ik commented

Unfortunately, the Sort operator as it is defined now will very probably be dropped entirely. This is due to the fact, that whole partitioning abstraction is not working well with Apache Beam. If you want to do operation like "repartitionAndSortWithinPartitions", there will be only one option how to do it:

  • assign each element a partitionId
  • do a reduce-by-key with this partitionId as a key
  • and use external sort on the supplied Iterable over all elements in the same partition

This implementation can be generalized (maybe it could replace the current implementation of the Sort operator), but it is not now, unfortunately. Would you have interest in implementing it?

je-ik commented

Obsoleted via #160 and #158