[BUG] Can't use modularized code as UDF
Closed this issue · 2 comments
Describe the bug
Cannot use code from code defined in other modules in UDFs.
To Reproduce
-
Configure extension and attach to cluster.
-
Create file
mylib.py
containing some code, eg:
def reverse(text):
return text[::-1]
- Create file
main.py
with an UDF and call it, eg:
import logging
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
import mylib
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
{'data': 'foo'},
])
reverse_udf = udf(mylib.reverse)
rows = df.select(reverse_udf(col('data'))).collect()
logging.error(rows)
Final folder looks like this:
.
├── main.py
└── mylib.py
-
Select
main.py
and try to run withUpload and Run File with Databricks
-
It fails with
ModuleNotFoundError: No module named 'mylib'
System information:
- Paste the output ot the
Help: About
command (CMD-Shift-P).
Version: 1.81.0
Commit: 6445d93c81ebe42c4cbd7a60712e0b17d9463e97
Date: 2023-08-02T12:36:11.334Z
Electron: 22.3.18
ElectronBuildId: 22689846
Chromium: 108.0.5359.215
Node.js: 16.17.1
V8: 10.8.168.25-electron.0
OS: Linux x64 5.19.0-46-generic
- Databricks Extension Version
v1.1.1
Databricks Extension Logs
Please attach the databricks extension logs <--- (this link seems broken.)
Additional context
The problem seems to be the fact that sys.path
does not include /Workspace/Users/...
.
Hi @igorgatis. This is a limitation with spark. I have reached out to relevant internal teams for advise about the best practices here. Will keep you posted here about updates.
Hey guys,
Did you try to add it in the sys.path manually in the file which is the entrypoint of your code ?
I managed to make tests working with a modularized library which seems to be pretty the same if I'm correct.
Of course this is not the cleanest way but it can be a workaround for now.
EDIT: the code I have on my entrypoint file
You have to probably modify the path for the "lib_root" since it depends on how your project is configured