Support array expressions for __getitem__
timsaucer opened this issue · 2 comments
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
If you have an expression that is an array, it would be convenient to do something like df.select(col("a")[0])
to get the first item in that array.
Describe the solution you'd like
This could be done with a simple check of the argument to __getitem__
. If it is a string or expression, continue to do what we do now. If it is a number then we call functions.array_element
.
Describe alternatives you've considered
Current work around is to use functions.array_element
.
Additional context
This seems to do the trick. Leaving here for myself as a note to commit it after I finish the current task I have
def __getitem__(self, key: str | int) -> Expr:
"""For struct data types, return the field indicated by ``key``."""
if isinstance(key, int):
return Expr(functions_internal.array_element(self.expr, Expr.literal(key).expr))
return Expr(self.expr.__getitem__(key))
def test_expr_getitem() -> None:
ctx = SessionContext()
data = {'array_values': [[1, 2, 3], [4, 5], [6], []], 'struct_values': [
{ 'name': 'Alice', 'age': 15 },
{ 'name': 'Bob', 'age': 14 },
{ 'name': 'Charlie', 'age': 13 },
{ 'name': None, 'age': 12 },
]}
df = ctx.from_pydict(data, name='table1')
names = df.select(col("struct_values")["name"].alias("name")).collect()
names = [ r.as_py() for rs in names for r in rs["name"]]
array_values = df.select(col("array_values")[2].alias("value")).collect()
array_values = [ r.as_py() for rs in array_values for r in rs["value"]]
assert names == ['Alice', 'Bob', 'Charlie', None]
assert array_values == [2, 5, None, None]
Something to consider: If we do this then the natural python
view would be that col("a")[0]
would give me the first element but the SQL approach would be col("a")[1]
as the first element. It's a potential source of confusion.