How to use spark-hats functions on a DataFrame in PySpark?

Question

How to use spark-hats functions on a DataFrame in PySpark?

Closed this issue 3 years ago · 2 comments

This looks like a super helpful extension for dealing with deeply nested fields in Spark. I'd love to see if it can help me with my problems, but I'm using PySpark in Python.

I think it's installing properly with:

from pyspark.sql.session import SparkSession
spark = (
    SparkSession.builder
    .config('spark.jars.packages', 'za.co.absa:spark-hats_2.12:0.2.2')
    .getOrCreate()
)

Since I see the following in the logs:

:: resolution report :: resolve 157ms :: artifacts dl 4ms
	:: modules in use:
	za.co.absa#spark-hats_2.12;0.2.2 from central in [default]
	za.co.absa#spark-hofs_2.12;0.4.0 from central in [default]

But then if I create a dataframe and try to access the functions, I'm not having success:

>>> empty_df = spark.createDataFrame([], schema="")
>>> empty_df.nestedWithColumn()
AttributeError: 'DataFrame' object has no attribute 'nestedWithColumn'
>>> empty_df._jdf.nestedWithColumn()
Py4JError: An error occurred while calling o63.nestedWithColumn. Trace:
py4j.Py4JException: Method nestedWithColumn([]) does not exist

So not sure if anyone has experience with PySpark here and has any insights. I'll also update this issue if I find a solution.

Answer 1 · 2021-12-18T04:18:54.000Z

You need to wrap the jvm calls, here's an example for nestedWithColumn:

def nestedWithColumn(df, newColumnName: str, expression: Column) -> DataFrame:
    from pyspark.sql.column import _to_java_column
    
    return DataFrame(
        df.sql_ctx._jvm.za.co.absa.spark.hats.Extensions.DataFrameExtension(df._jdf)
        .nestedWithColumn(newColumnName, _to_java_column(expression)),
        df.sql_ctx
    )

You could also monkey-patch the pyspark DataFrame with those wrappers.

Answer 2 · 2021-12-21T12:51:39.000Z

Thanks a lot for the answer, @ravwojdyla!