databricks/databricks-vscode

[BUG] - ModuleNotFoundError when calling a function that uses a User Defined Function (UDF)

pietrodantuono opened this issue · 2 comments

System information

  • Runtime: Databricks-VSCode (Databricks Runtime 13.3.x Scala 2.12)
  • PySpark version: 3.4.2
  • Python version: 3.10.1
  • Operating system: Windows 10 Build 19045

Code structure

repo/
├── helper/
│   ├── __init__.py
│   ├── helper_module.py
│   └── ...
├── notebooks/
│   ├── notebook.ipynb
│   └── ...
└── pyproject.toml

Code sample

# helper_module.py

# From the Python Standard Library
import struct
# From PySpark
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import DataFrame



def str_hex_to_numeric(
    hex_value: str,
    data_type_name: str
) -> float:
    """Convert a hex string to a numeric value."""
    if data_type_name == "Float":
        return struct.unpack('!f', bytes.fromhex(hex_value))[0]
    raise ValueError(f"Unknown data type: {data_type_name}")

def value_col_hex_to_numeric(
    df: DataFrame,
    value_col: str = "VALUE",
    data_type_name_col: str = "DATA_TYPE_NAME"
) -> DataFrame:
    """Convert a hex string to a numeric value."""
    return df.withColumn(
        value_col,
        F.udf(
            str_hex_to_numeric, T.FloatType()
        )(F.col(value_col), F.col(data_type_name_col))
    )
# notebook.ipynb
# Navigate to the repo root directory and install the helper module
%pip install -e .

# Import the helper module
from helper import helper_module

# Create a Spark DataFrame
df = spark.createDataFrame([("1", "Float", "3f800000"), ("2", "Float", "40000000"),
                            ("3", "Float", "40400000"), ("4", "Float", "40800000")],
                            ["INDEX", "DATA_TYPE_NAME", "VALUE"])

# Convert the hex string to a numeric value
df = helper_module.value_col_hex_to_numeric(df)

# Display the DataFrame
df.show()

# -- Databricks Connect returns the following error --
# ModuleNotFoundError: No module named 'helper'
# 
# -- While Azure Databricks returns the expected output --
# +-----+--------------+----------+
# |INDEX|DATA_TYPE_NAME|     VALUE|
# +-----+--------------+----------+
# |    1|         Float|       1.0|
# |    2|         Float|       2.0|
# |    3|         Float|       3.0|
# |    4|         Float|       4.0|
# +-----+--------------+----------+ 

I have the same issue.

Maybe to add to this:
When using the @udf decorator, or wrapping the str_hex_to_numeric function it works for me!

@udf
def str_hex_to_numeric(hex_value: str, data_type_name: str) -> float:
    ...

or

def udf_wrapper():
    def str_hex_to_numeric(hex_value: str, data_type_name: str) -> float:
        ...
    return udf(str_hex_to_numeric, FloatType())

What also doesn't work is referencing things from outside the functions scope, constants for example.

I have the same issue. I found it with this use case:

df = df.withColumn('result', my_udf(col('some_data')))

where my_udf is in a helper module.

The only solution I've found to this point is to package up the helper in a wheel and install the wheel on the cluster. And then run my notebook from the databricks workspace rather than vscode.