tordd() in bypass the lazy sorting after repartition
boazmohar opened this issue · 3 comments
An edge case after #319 and PR #320.
If I call tordd()
after repartition()
but before any other command that forces a task, I will get an unsorted rdd.
@jwittenbach, @freeman-lab What do you guys think about this? we could
- Add a warning in
tordd
if thesorted
property in bolt isFalse
- Force a sort in
tordd
Any other suggestions?
@freeman-lab @jwittenbach I suggest we add a few methods to base
related to this issue:
is_sorted()
would returnself.values._ordered
in spark mode andTrue
in local modesort()
would callself._rdd.sortByKey()
and setself.values._ordered
toTrue
, will not be available in local modetordd()
would issue a warning ifself.values._ordered
is False.
What do you think?
Nice @boazmohar, great ideas!
I think I'd rather put a sort()
method in bolt
, as you say, it's really only relevant to the distributed case, which bolt
is there to handle.
We might also want to call it order()
or at least make it all consistent, e.g. order()
and _ordered
or sort()
and _sorted
. And instead of isordered()
you can just add an @property
that returns the value, check out how we do shape
in bolt
(see the example here)
But we can definitely add the warning mentioned in (3) to thunder
now if you want to put in a PR for that.
@freeman-lab Once we will know the bolt API for sorting, I will do a PR with the warning.
Thanks for the comments!