databricks/databricks-vscode

I'd like to be able to import python modules from my project folder

KamodaP opened this issue · 7 comments

Hi,

I noticed that from the python file I'm executing other python files are not visible. It's not the case if I go to databricks web and open "ide" repo from my repo branch and run the same file from there - modules are picked up because "Files in repos" is enabled.

I tried searching for any possible solution, but it seems that the file executed from vscode is separate from the sync directory specified or in a different context and there's nothing I can do to emulate databricks web behaviour from vscode.

Hi,

I noticed that from the python file I'm executing other python files are not visible. It's not the case if I go to databricks web and open "ide" repo from my repo branch and run the same file from there - modules are picked up because "Files in repos" is enabled.

I tried searching for any possible solution, but it seems that the file executed from vscode is separate from the sync directory specified or in a different context and there's nothing I can do to emulate databricks web behaviour from vscode.

Hi,
As workaround you can try to check if your current working directory is pointing to the root of the project (os.getcwd()) - and if it is not - change current working directory to the root of the project (os.chdir())

Hi @KamodaP. Are you running this file using the "Run Python File" option and not one of the 2 Databricks run options? If that is the case, then you can add /path/to/project to your python.extraPaths setting.

For the 2 databricks run options, they should emulate the databricks web ui behaviour out of the box. If you can provide some reproductions steps, I can help you debug this.

Hi,

I am experiencing the same behaviour. I am trying to run the code using "Upload and Run file on Databricks" and it seems like that the runtime is not using the same version of my files. Eg. I modified a referenced file, re-synced, checked that file in the Databricks UI and while it looks fine, from VSCode the previous version runs.
The stacktrace references the following path:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/

It seems like to me that the previous version of my files are cached/stored in the internal filesystem of the cluster and the runtime does not refresh it.

This bug does not affect "Run File as Workflow on Databricks", that mode works just fine.

Hi @mmark188 I am able to edit files, and reference them using the "Upload and Run file on Databricks" option. The updates are also being picked up.

What cluster are you using? Are you syncing to a Repo or a Workspace folder? Can you provide the full error message?

Thanks @AndRosis, it actually worked!

@kartikgupta-db of course I'm using the "run on databricks" option, but not as workflow. And I'm using repo sync, as mentioned I can access it on web and it works from there. From workaround from AndRosis I presume that the plugin lacks somewhere adding sync directory to pythonpath (or chdiring there).

The steps to reproduce would be to create a project with two modules (.py files) and just trying to import one from the other, setup.py bdist wheel it and install on cluster. Configure plugin with az cli and a repo sync, select the cluster where you've installed. Create third module and try to import it from one of the others. Choose the "run on databricks" option and in the output I see blue databricks logs stating "module not found".

@mmark188 are you developing a newer version of a library that is already installed on clusters?

@kartikgupta-db I am using my single user interactive cluster and syncing to a Repo. I cannot provide an error message, because there's nothing specific.

Just to be a bit more clear about my problem and why I think it is in connection with the one @KamodaP struggles with:

My code ran into an unhandled None exception in one of the my .py files. After fixing that, I have noticed that by running the code using the "Run file on Databricks" option, th same exception is being thrown and in the stacktrace I can clearly see the unmodified version of that method. By checking that file in the web UI it is shown in its corrected version.

All in all, it seems like there's an inconsistency what's being launched from the VSCode extension and what's actually resides in the sync location.
@KamodaP Your problem may be similar, just your imported module not existing at all in the runtime context.

@KamodaP

From workaround from AndRosis I presume that the plugin lacks somewhere adding sync directory to pythonpath (or chdiring there).

We actually do a chdir here

From what I understand, you are trying to do some development work on a library that is already installed in the cluster? When you run code from the ide, we append the cwd to the end of sys.path. This means that local code has lower priority than the installed library, when python is resolving imports. When running the same code from webui (or using "Run File as Workflow on Databricks", the cwd is automatically added to sys.path before library paths. Hence your local changes take precedence.

I believe this also explains the error @mmark188 is seeing where python is trying to search inside the libraries path /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/.

I will add this fix to the next release (happening sometime tomorrow). Please check that out and let me know if that fixes your issue.