apache/datafusion-python

Support array expressions for __getitem__

timsaucer opened this issue · 2 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

If you have an expression that is an array, it would be convenient to do something like df.select(col("a")[0]) to get the first item in that array.

Describe the solution you'd like

This could be done with a simple check of the argument to __getitem__. If it is a string or expression, continue to do what we do now. If it is a number then we call functions.array_element.

Describe alternatives you've considered

Current work around is to use functions.array_element.

Additional context

This seems to do the trick. Leaving here for myself as a note to commit it after I finish the current task I have

    def __getitem__(self, key: str | int) -> Expr:
        """For struct data types, return the field indicated by ``key``."""
        if isinstance(key, int):
            return Expr(functions_internal.array_element(self.expr, Expr.literal(key).expr))
        return Expr(self.expr.__getitem__(key))

def test_expr_getitem() -> None:
    ctx = SessionContext()
    data = {'array_values': [[1, 2, 3], [4, 5], [6], []], 'struct_values': [
        { 'name': 'Alice', 'age': 15 },
        { 'name': 'Bob', 'age': 14 },
        { 'name': 'Charlie', 'age': 13 },
        { 'name': None, 'age': 12 },
    ]}
    df = ctx.from_pydict(data, name='table1')
    
    names = df.select(col("struct_values")["name"].alias("name")).collect()
    names = [ r.as_py() for rs in names for r in rs["name"]]
    
    array_values = df.select(col("array_values")[2].alias("value")).collect()
    array_values = [ r.as_py() for rs in array_values for r in rs["value"]]

    assert names == ['Alice', 'Bob', 'Charlie', None]
    assert array_values == [2, 5, None, None]

Something to consider: If we do this then the natural python view would be that col("a")[0] would give me the first element but the SQL approach would be col("a")[1] as the first element. It's a potential source of confusion.