databricks/databricks-vscode

[BUG] Can't use modularized code as UDF

Closed this issue · 2 comments

Describe the bug
Cannot use code from code defined in other modules in UDFs.

To Reproduce

  1. Configure extension and attach to cluster.

  2. Create file mylib.py containing some code, eg:

def reverse(text):
    return text[::-1]
  1. Create file main.py with an UDF and call it, eg:
import logging
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col

import mylib

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
    {'data': 'foo'},
])

reverse_udf = udf(mylib.reverse)

rows = df.select(reverse_udf(col('data'))).collect()
logging.error(rows)

Final folder looks like this:

.
├── main.py
└── mylib.py
  1. Select main.py and try to run with Upload and Run File with Databricks

  2. It fails with ModuleNotFoundError: No module named 'mylib'

System information:

  1. Paste the output ot the Help: About command (CMD-Shift-P).
Version: 1.81.0
Commit: 6445d93c81ebe42c4cbd7a60712e0b17d9463e97
Date: 2023-08-02T12:36:11.334Z
Electron: 22.3.18
ElectronBuildId: 22689846
Chromium: 108.0.5359.215
Node.js: 16.17.1
V8: 10.8.168.25-electron.0
OS: Linux x64 5.19.0-46-generic
  1. Databricks Extension Version
    v1.1.1

Databricks Extension Logs
Please attach the databricks extension logs <--- (this link seems broken.)

Additional context
The problem seems to be the fact that sys.path does not include /Workspace/Users/....

Hi @igorgatis. This is a limitation with spark. I have reached out to relevant internal teams for advise about the best practices here. Will keep you posted here about updates.

Hey guys,
Did you try to add it in the sys.path manually in the file which is the entrypoint of your code ?

I managed to make tests working with a modularized library which seems to be pretty the same if I'm correct.

Of course this is not the cleanest way but it can be a workaround for now.

EDIT: the code I have on my entrypoint file

image

You have to probably modify the path for the "lib_root" since it depends on how your project is configured