GrainLearning/grainLearning

in Windows wandb doesn't generate latest-run

luisaforozco opened this issue · 3 comments

When trying to merge RNN to GrainLearning the CI/CD showed that the all tests were passing for linux, macOS but not for windows.
I debugged it in a windows machine and found that the issue comes from wandb (see reported issue).
The error is specifically at unit test test_rnn_model.py/test_train when:

assert Path("wandb/latest-run/files/model-best.h5").exists()

Indeed, in windows, the simlink to latest-run is not automatically created by wandb.

Other options, provided by wandb, to access the files of the latest run include:

  • Syncing the runs to wandb and then retrieving the run with the closes creation date.
  • Creating a docker container to have dry-runs locally in your machine.
    Both options are an overkill for unit tests.

A dirty option is to manually search for the latest folder, but this seems hard to generalize across platforms: unix and win32. Including a variable platform might be an option, but it comes at a cost: complexification of the code and maintance of such code is more prone to errors.

I have added a decorator to test_train so that is skipped if on windows: sys.platform=='win32'.
Debugging this, I also found that keras models created and saved in macOS cannot be loaded in windows, but a model created and saved in windows can be loaded in windows.
Thus, I deactivated the check for loading a full model, which is of course not ideal...

I've created branch test_windows_wandb_simlink, to try a few things that wandb maintainers suggested. Specifically:
wandb.init(settings=wandb.Settings(symlink=True)) But only for windows, since for other platforms such setting is not necessary.
It seems that now folder latest-run exists but I got now a new error:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'wandb\\debug-cli.runneradmin.log'

Such error message is triggered when trying to remove wandb folder. I tried adding if sys.platform=='win32': wandb.finish(), but I still get the error message.

After a lot of research on this issue I come to the conclusion that in wandb framework debug-cli.{user_name}.log is not closed properly until the python process is finished. This is of course not a problem in unix, but in windows is not possible to delete the folder containing such file. This is particularly annoying in the unit tests because the teardown (deleting created folders and files) cannot be completed.
A possible solution is to not delete wandb folder in the windows case, but this means that windows users running the tests will get this weird wandb folder. In the case of the github runners that will not be a problem.
I also tried wandb sync --clean-force but that would throw an error if the user is not logged in, and possibly pollute the wandb workspace.