thunder-project/thunder

tordd() in bypass the lazy sorting after repartition

boazmohar opened this issue · 3 comments

An edge case after #319 and PR #320.
If I call tordd() after repartition() but before any other command that forces a task, I will get an unsorted rdd.
@jwittenbach, @freeman-lab What do you guys think about this? we could

  1. Add a warning in tordd if the sorted property in bolt is False
  2. Force a sort in tordd
    Any other suggestions?

@freeman-lab @jwittenbach I suggest we add a few methods to base related to this issue:

  1. is_sorted() would return self.values._ordered in spark mode and True in local mode
  2. sort() would call self._rdd.sortByKey() and set self.values._ordered to True, will not be available in local mode
  3. tordd() would issue a warning if self.values._ordered is False.

What do you think?

Nice @boazmohar, great ideas!

I think I'd rather put a sort() method in bolt, as you say, it's really only relevant to the distributed case, which bolt is there to handle.

We might also want to call it order() or at least make it all consistent, e.g. order() and _ordered or sort() and _sorted. And instead of isordered() you can just add an @property that returns the value, check out how we do shape in bolt (see the example here)

But we can definitely add the warning mentioned in (3) to thunder now if you want to put in a PR for that.

@freeman-lab Once we will know the bolt API for sorting, I will do a PR with the warning.
Thanks for the comments!