a inner join sql return wrong results
Closed this issue · 7 comments
Checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
df = pl.DataFrame({
"A": [1, 2, 3, 4, 5],
"B": [5, 4, 3, 2, 1],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
})
context = pl.SQLContext(t=df, t1=df)
lf = context.execute("""
SELECT t.A, t.fruits, t1.B, t1.cars FROM t, t1 WHERE t.A=t1.B
""")
lf.collect()
# shape: (1, 4)
# ┌─────┬────────┬─────┬────────┐
# │ A ┆ fruits ┆ B ┆ cars │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ i64 ┆ str │
# ╞═════╪════════╪═════╪════════╡
# │ 3 ┆ apple ┆ 3 ┆ beetle │
# └─────┴────────┴─────┴────────┘
lf = context.execute("""
SELECT t.A, t.fruits, t1.B, t1.cars FROM t JOIN t1 ON t.A=t1.B
""")
lf.collect()
# shape: (5, 4)
# ┌─────┬────────┬─────┬────────┐
# │ A ┆ fruits ┆ B ┆ cars │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ i64 ┆ str │
# ╞═════╪════════╪═════╪════════╡
# │ 5 ┆ banana ┆ 1 ┆ beetle │
# │ 4 ┆ apple ┆ 2 ┆ beetle │
# │ 3 ┆ apple ┆ 3 ┆ beetle │
# │ 2 ┆ banana ┆ 4 ┆ audi │
# │ 1 ┆ banana ┆ 5 ┆ beetle │
# └─────┴────────┴─────┴────────┘
Log output
No response
Issue description
1st sql should return 5 rows as following , but it only return 1 rows
2nd sql should return every lines has t.A=t1.B,but only 1 row fits that (t.A=t1.B=3)
Expected behavior
they both return 5 rows like this
+----------------------------
|A | fruits | B | cars
|--- | --- | --- | ---
|i64 | str | i64 | str
¦-----+--------+-----+-------
|1 | banana | 1 | beetle
|2 | banana | 2 | beetle
|3 | apple | 3 | beetle
|4 | apple | 4 | audi
|5 | banana | 5 | beetle
+----------------------------
Installed versions
--------Version info---------
Polars: 0.20.3
Index type: UInt32
Platform: Windows-7-6.1.7601-SP1
Python: 3.8.8 (tags/v3.8.8:024d805, Feb 19 2021, 13:18:16) [MSC v.1928 64 bit (AMD64)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 2.0.0
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2021.11.1
gevent: 22.10.2
hvplot: <not installed>
matplotlib: 3.3.4
numpy: 1.23.4
openpyxl: 3.1.1
pandas: 1.3.2
pyarrow: 6.0.1
pydantic: 1.8.2
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
Similar thing happens to cross join: #11927
seems left join also wrong
df = pl.DataFrame({
"id":['1','2','3']
})
df2 = pl.DataFrame({
"id":['4','5','6']
})
ctx = pl.SQLContext(df=df, df2=df2)
lp1=ctx.execute(
"""
SELECT df.id as id1,df2.id as id2
FROM df
left JOIN df2
on df.id=df2.id
""",
eager=False,
)
lp1.collect()
shape: (3, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ 3 │
└─────┴─────┘
should returns
shape: (3, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ 1 ┆ null │
│ 2 ┆ null │
│ 3 ┆ null │
└─────┴─────┘
@universalmind303 can you take a look here? It seems we create wrong plans somehow, as it only happens in SQL.
Triage:
-
For the first query: we don't actually recognise/support implicit join syntax (at all) yet, so it gets parsed (incorrectly) as a simple filter op where
A == B
. I'll have to see if we can identify the implicit join syntax and either generate the equivalent inner join or raise an error. -
For the second query: it seems we aren't resolving the post-join column selection properly here. We're doing the join correctly, but then returning only the cols from the left hand table. Will look at this first 🤔
SELECT t.A, t.fruits, t1.B, t1.cars FROM t, t1 WHERE t.A=t1.B
still returns wrong result in version 10. 20.31
SELECT t.A, t.fruits, t1.B, t1.cars FROM t, t1 WHERE t.A=t1.B
still returns wrong result in version 10. 20.31
Indeed - as mentioned earlier we don't support implicit join syntax at the moment; can you confirm that the other error is fixed and open an issue for implicit join syntax as a new feature request? (Easier for us to track open/closed issues if they are distinct) 😎👍
confirmed the inner join syntax works now