[BUG] Can't use modularized code as UDF

Question

[BUG] Can't use modularized code as UDF

Closed this issue 10 months ago · 2 comments

Describe the bug
Cannot use code from code defined in other modules in UDFs.

To Reproduce

Configure extension and attach to cluster.
Create file mylib.py containing some code, eg:

def reverse(text):
    return text[::-1]

Create file main.py with an UDF and call it, eg:

import logging
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col

import mylib

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
    {'data': 'foo'},
])

reverse_udf = udf(mylib.reverse)

rows = df.select(reverse_udf(col('data'))).collect()
logging.error(rows)

Final folder looks like this:

.
├── main.py
└── mylib.py

Select main.py and try to run with Upload and Run File with Databricks
It fails with ModuleNotFoundError: No module named 'mylib'

System information:

Paste the output ot the Help: About command (CMD-Shift-P).

Version: 1.81.0
Commit: 6445d93c81ebe42c4cbd7a60712e0b17d9463e97
Date: 2023-08-02T12:36:11.334Z
Electron: 22.3.18
ElectronBuildId: 22689846
Chromium: 108.0.5359.215
Node.js: 16.17.1
V8: 10.8.168.25-electron.0
OS: Linux x64 5.19.0-46-generic

Databricks Extension Version
v1.1.1

Databricks Extension Logs
Please attach the databricks extension logs <--- (this link seems broken.)

Additional context
The problem seems to be the fact that sys.path does not include /Workspace/Users/....

Answer 1 · 2023-08-09T13:40:10.000Z

Hi @igorgatis. This is a limitation with spark. I have reached out to relevant internal teams for advise about the best practices here. Will keep you posted here about updates.

Answer 2 · 2023-08-11T08:18:00.000Z

Hey guys,
Did you try to add it in the sys.path manually in the file which is the entrypoint of your code ?

I managed to make tests working with a modularized library which seems to be pretty the same if I'm correct.

Of course this is not the cleanest way but it can be a workaround for now.

EDIT: the code I have on my entrypoint file

You have to probably modify the path for the "lib_root" since it depends on how your project is configured