Best practice for handing off persisted collection partitions

Question

Best practice for handing off persisted collection partitions

rjzamora opened this issue a year ago · 5 comments

While working on rapidsai/dask-cuda#1311, I noticed that a common practice used in down-stream libraries no longer works (cleanly) with the move to dask-expr.

The common practice:

Persist a collection (df = df.persist())
Find the worker-to-partition mapping for the persisted collection using mapping = client.who_has() and df.__dask_keys__()

The problem with dask-expr:

In dask-expr, calling df.persist() will change the "name" (and therefore the keys) of the collection. The name change is a result of both expression optimization, and the creation of a new FromGraph expression. Therefore, you cannot call df = df.persist(), and then search for the keys of df in the cluster.

The question: What is the new "best practice" for patterns like this?

For reference, here is something that seems to work for now:

    df = df.persist()
    try:
        # Only works for FromGraph-backed collection
        persisted_keys = df.keys
    except AttributeError:
        # Only works for a legacy collection
        persisted_keys = df.__dask_keys__()

Answer 1 · 2024-03-19T18:04:51.000Z

Maybe you want futures_of?

…

On Tue, Mar 19, 2024, 1:02 PM Richard (Rick) Zamora < ***@***.***> wrote: While working on rapidsai/dask-cuda#1311 <rapidsai/dask-cuda#1311>, I noticed that a common practice used in down-stream libraries no longer works (cleanly) with the move to dask-expr. *The common practice*: 1. Persist a collection (df = df.persist()) 2. Find the worker-to-partition mapping for the persisted collection using mapping = client.who_has() and df.__dask_keys__() *The problem with dask-expr*: In dask-expr, calling df.persist() will change the "name" (and therefore the keys) of the collection. The name change is a result of both expression optimization, and the creation of a new FromGraph expression. Therefore, you cannot call df = df.persist(), and then search for the keys of df in the cluster. *The question*: What is the new "best practice" for patterns like this? For reference, here is something that seems to work for now: df = df.persist() try: # Only works for FromGraph-backed collection persisted_keys = df.keys except AttributeError: # Only works for a legacy collection persisted_keys = df.__dask_keys__() — Reply to this email directly, view it on GitHub <#988>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDJQZ3R7HWIUN7SWXDYZB4TPAVCNFSM6AAAAABE6CAPNGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TKNRTGMYDGOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2024-03-20T03:24:56.000Z

Okay, thanks - I suppose this approach is backward compatible:

df = df.persist()
persisted_keys = [f.key for f in c.client.futures_of(df)]

Answer 3 · 2024-03-20T09:23:36.000Z

Could you provide a little more context for what you're doing? This feels to me like an abstraction leak that bites us whenever we touch this API. I am touching this API with the scheduler integration again and this shortcoming could be fixed but it would be helpful to know a little about the application

Answer 4 · 2024-03-20T12:12:56.000Z

I recommend just using the dask.distributed.futures_of function. It's been around for a while and genearally how this probem gets solved.

Answer 5 · 2024-03-20T13:39:53.000Z

This feels to me like an abstraction leak that bites us whenever we touch this API.

By "this" API, are you referring to futures_of or who_has? I'm happy to use whatever you all recommend moving forward.

it would be helpful to know a little about the application

I've seen this used in a few down-stream libraries. The specific application I am looking at right now is just a custom shuffling algorithm that I am very comfortable experimenting with. However, other down-stream libraries (e.g. cugraph, nemo) also use who_has to temporarily hand-off execution and communication to something other than dask. For example, cugraph will persist the collection, figure out where all the data is, and then execute a collective operation in C++/NCCL land. This is a very common pattern in rapids.

I recommend just using the dask.distributed.futures_of function. It's been around for a while and genearally how this probem gets solved.

Great. I'm not familiar with this API, but happy to use it and recommend it if it works.