pola-rs/polars

a inner join sql return wrong results

Closed this issue · 7 comments

l1t1 commented

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({
    "A": [1, 2, 3, 4, 5],
    "B": [5, 4, 3, 2, 1],
    "fruits": ["banana", "banana", "apple", "apple", "banana"],
    "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
})

context = pl.SQLContext(t=df, t1=df)
lf = context.execute("""
    SELECT t.A, t.fruits, t1.B, t1.cars FROM t, t1 WHERE t.A=t1.B
""")
lf.collect()

# shape: (1, 4)
# ┌─────┬────────┬─────┬────────┐
# │ A   ┆ fruits ┆ B   ┆ cars   │
# │ --- ┆ ---    ┆ --- ┆ ---    │
# │ i64 ┆ str    ┆ i64 ┆ str    │
# ╞═════╪════════╪═════╪════════╡
# │ 3   ┆ apple  ┆ 3   ┆ beetle │
# └─────┴────────┴─────┴────────┘

lf = context.execute("""
    SELECT t.A, t.fruits, t1.B, t1.cars FROM t JOIN t1 ON t.A=t1.B
""")
lf.collect()
# shape: (5, 4)
# ┌─────┬────────┬─────┬────────┐
# │ A   ┆ fruits ┆ B   ┆ cars   │
# │ --- ┆ ---    ┆ --- ┆ ---    │
# │ i64 ┆ str    ┆ i64 ┆ str    │
# ╞═════╪════════╪═════╪════════╡
# │ 5   ┆ banana ┆ 1   ┆ beetle │
# │ 4   ┆ apple  ┆ 2   ┆ beetle │
# │ 3   ┆ apple  ┆ 3   ┆ beetle │
# │ 2   ┆ banana ┆ 4   ┆ audi   │
# │ 1   ┆ banana ┆ 5   ┆ beetle │
# └─────┴────────┴─────┴────────┘

Log output

No response

Issue description

1st sql should return 5 rows as following , but it only return 1 rows
2nd sql should return every lines has t.A=t1.B,but only 1 row fits that (t.A=t1.B=3)

Expected behavior

they both return 5 rows like this

+----------------------------
|A   | fruits | B   | cars   
|--- | ---    | --- | ---    
|i64 | str    | i64 | str    
¦-----+--------+-----+-------
|1   | banana | 1   | beetle 
|2   | banana | 2   | beetle 
|3   | apple  | 3   | beetle 
|4   | apple  | 4   | audi   
|5   | banana | 5   | beetle 
+----------------------------

Installed versions

--------Version info---------
Polars:               0.20.3
Index type:           UInt32
Platform:             Windows-7-6.1.7601-SP1
Python:               3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2021.11.1
gevent:               22.10.2
hvplot:               <not installed>
matplotlib:           3.3.4
numpy:                1.23.4
openpyxl:             3.1.1
pandas:               1.3.2
pyarrow:              6.0.1
pydantic:             1.8.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

Similar thing happens to cross join: #11927

l1t1 commented

seems left join also wrong

df = pl.DataFrame({
    "id":['1','2','3']
})

df2 = pl.DataFrame({
    "id":['4','5','6']
})
ctx = pl.SQLContext(df=df, df2=df2)

lp1=ctx.execute(
    """
    SELECT df.id as id1,df2.id as id2
    FROM df
        left JOIN df2
    on df.id=df2.id
    """,
    eager=False,
)
lp1.collect()
shape: (3, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ 11   │
│ 22   │
│ 33   │
└─────┴─────┘

should returns

shape: (3, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ 1   ┆ null   │
│ 2   ┆ null   │
│ 3   ┆ null   │
└─────┴─────┘

@universalmind303 can you take a look here? It seems we create wrong plans somehow, as it only happens in SQL.

Triage:

  • For the first query: we don't actually recognise/support implicit join syntax (at all) yet, so it gets parsed (incorrectly) as a simple filter op where A == B. I'll have to see if we can identify the implicit join syntax and either generate the equivalent inner join or raise an error.

  • For the second query: it seems we aren't resolving the post-join column selection properly here. We're doing the join correctly, but then returning only the cols from the left hand table. Will look at this first 🤔

l1t1 commented

SELECT t.A, t.fruits, t1.B, t1.cars FROM t, t1 WHERE t.A=t1.B still returns wrong result in version 10. 20.31

SELECT t.A, t.fruits, t1.B, t1.cars FROM t, t1 WHERE t.A=t1.B still returns wrong result in version 10. 20.31

Indeed - as mentioned earlier we don't support implicit join syntax at the moment; can you confirm that the other error is fixed and open an issue for implicit join syntax as a new feature request? (Easier for us to track open/closed issues if they are distinct) 😎👍

l1t1 commented

confirmed the inner join syntax works now