googleapis/python-bigquery-dataframes

Allow querying the bigframes dataframe with custom BigQuery SQL like pandasql

NiloFreitas opened this issue · 2 comments

Feature Request

Supposed I used bigframes and processed my data like this:
res_df = df.assign(pred=input_remote_function.apply(generate_text))

And got this output dataframe res_df:

uri pred
gs://dataproc-metastore-public-binaries/ads_ba... {"interpretation": " The primary message of th...
gs://dataproc-metastore-public-binaries/ads_ba... {"interpretation": " The primary message of th...
gs://dataproc-metastore-public-binaries/ads_ba... {"interpretation": " The primary message of th...

As you can see, the pred column is a JSON string.

Describe the solution you'd like

I would like to use BigQuery JSON Functions to explode this column into a column for each JSON attribute.

For example, I could write a custom BigQuery SQL code to run for this dataframe, like this:
res_df = df.sqldf("SELECT *, JSON_EXTRACT(pred, '$.interpretation') AS interpretation FROM df")

Getting as a result the res_df:

uri pred interpretation
gs://datap... {"interpretation": " The primary message of th... The primary messag...
gs://datap... {"interpretation": " The primary message of th... The primary messag...
gs://datap... {"interpretation": " The primary message of th... The primary messag...

Describe alternatives you've considered

As an alternative I need to use remote functions, like this one, to explode the column:

@bpd.remote_function(
    [str],
    str,
    bigquery_connection=CONNECTION_ID
)
def extract_json_attribute(input_content: str) -> str:
    
    pred = input_content.split("|||")[0]
    json_attribute = input_content.split("|||")[1]
    
    import json
    return json.loads(pred)[json_attribute]
input_remote_function_int = res_df['pred'] + '|||interpretation'
df = res_df.assign(interpretation=input_remote_function_int.apply(extract_json_attribute))

Which I cannot pass multiple parameter because of #646

Additional context

The pandasql allow us to query pandas DataFrames using SQL syntax.
Bigframes supporting, pushing down the query to BQ would be helpful.

https://pypi.org/project/pandasql/

It is an interesting request. Maybe we can just inject the df table into the SQL statements? But one argument is if user wants to use SQL, why they don't just use SQL with BQ? @tswast @TrevorBergeron WDYT?

I was just talking with @shanecglass and @aliciawilliams about the potential for a similar feature.

I was thinking of providing a SQL engine to our equivalent of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eval.html, but mimicking the pandasql interface could be even more natural. Looks like it's MIT license (https://github.com/yhat/pandasql/blob/master/LICENSE.txt) so we could pull some of that into third_party if needed.