dask-expr planning causes incorrect merge behavior at compute time after dropna

Question

dask-expr planning causes incorrect merge behavior at compute time after dropna

zmbc opened this issue 8 months ago · 0 comments

Describe the issue: With query planning turned on, a dropna-then-merge that results in overlapping columns will correctly determine what columns it should create but they don't actually exist when that merge is computed.

Minimal Complete Verifiable Example:

import dask
# If you uncomment this, it works
# dask.config.set({"dataframe.query-planning": False})

import dask.dataframe as dd
import pandas as pd

sample_df = dd.from_pandas(pd.DataFrame({'foo': ["1", "2", "3"], 'bar': ["4", "5", "6"]}), npartitions=1)
sample_df_dropna = sample_df.dropna(subset=["foo"], how="any")

print(sample_df.merge(
    sample_df,
    on=["foo"], how="left"
).columns)

print(sample_df.merge(
    sample_df,
    on=["foo"], how="left"
).compute().columns)

print(sample_df_dropna.merge(
    sample_df_dropna,
    on=["foo"], how="left"
).columns)

print(sample_df_dropna.merge(
    sample_df_dropna,
    on=["foo"], how="left"
).compute().columns)

The last print statement shows that when the expression is actually computed, bar_x and bar_y aren't created by the merge as they should be.

Anything else we need to know?:

Environment:

Dask version: 2024.5.0
Python version: 3.10.14
Operating System: Linux
Install method (conda, pip, source): pip