usage in multi-user environments -> automated weights download
Opened this issue ยท 20 comments
I'm trying to install membrain-seg in a multi-user environment where not having the weights download be automated means having to tell each user to download the checkpoint, send it to the cluster before they can use the program - not ideal
I'll outline my suggested implementation below - feedback welcome!
Zenodo recently added support for multiple curators so hosting the data on zenodo will work well.
- I would upload an entry for the weights with your own account then invite my account (
alisterburt
) to also manage the entry. - Add code for automatically downloading the weights using pooch
- make sure pooch is added as a dependency to pyproject.toml
@LorenzLamm what do you think?
Absolutely agreed! I'll look into it today.
Thanks a lot for your implementation suggestion. @rdrighetto and I also looked into some options other than Zenodo, but couldn't find a more comfortable solution.
So let's go for it :)
Sorry for this feature taking so long.
A couple comments on this:
- The problem with a multi-user environment like this seems more that there is no shared location where one person could upload the weights and everybody can read it from there? This is how our lab users membrain-seg on our cluster. This is not a membrain-seg issue IMO.
Having said that:
- Given the weights are heavy (~700 MB) I would never have them downloaded automatically without explicit consent from the user.
Having said that [2]:
- I totally agree it's great and scientifically a good practice to share the weights on Zenodo and include a feature in membrain-seg to fetch it from there :-)
Yes, I agree. It should not be downloaded automatically.
I guess it's best to add the downloading functionality to the CLI?
Maybe we could also ask the user whether an automatic download should be performed in case no path to a checkpoint is provided for segmentation.
Yes, I think a CLI option to download the weights (with an optional path argument) is the way to go.
And the check in case a checkpoint has not been provided is nice to have, but not essential IMO.
MemBrain-seg models now available on Zenodo: https://zenodo.org/records/10633840
However, I was not able to add you as data curator, @alisterburt . As far as I understand, this feature works only for Zenodo communities, where multiple people can take various roles.
Submissions can then be added to this community and community data curators can edit the submissions associated to the community.
So maybe it makes sense if you create a teamtomo community on Zenodo, and we add the MemBrain-seg weights submission to this community?
Is it possible to have the weights available by default rather than needing their path to be specified every time if they are in a shared location with the proposed setup? This would be ideal - otherwise agree with what you've all said!
We could also have a prompt at the CLI which asks if you want to download/where you want to download to and this could run by default on the first run of the program, rather than a separate download program?
Basically, can we persistently set the default location for the checkpoint from the download code
Yes, would be cool if the checkpoint argument doesn't need to be passed every time membrain-seg is running.
Would it make sense to have a config file with the location to the checkpoint file?
If the config file is empty, we can ask whether the automatic model should be run and then add the download path to the config file.
Or is there a more elegant way to do this?
I don't love config files for such a simple application, they introduce extra state and an extra step to running the program
Thanks for looking at zenodo I'll look into a community today :-)
Agree with @alisterburt that config files should be avoided, especially for simple things like this. My suggestion is to specify the checkpoint path through an environment variable like MEMBRAIN_SEG_MODEL
(the approach I took for the Scipion plugin).
In a multi-user setup this environment variable can be set when loading the MemBrain-seg module, for example. I don't know if it's possible to set arbitrary environment variables when activating a conda environment, but that would be an alternative as well. In any case, I wouldn't worry so much about how this environment variable is set because it's something highly dependent on the computational environment.
If the environment variable is empty/unset then the model must be specified via the CLI argument (as is currently).
If the model is neither provided by the environment variable nor by the CLI, then we ask if it should be downloaded.
Alternatively, we don't ask anything but just add an extra option to membrain segment
for downloading the model.
@rdrighetto MEMBRAIN_SEG_WEIGHTS
works nicely!
You can set arbitrary environment variables when activating a conda environment in a unix system by adding shell scripts to the activate.d
and deactivate.d
directories inside <env_path>/etc/conda
(warp_build) [burta2@ec-hub1-sc1 ~]$ ls /home/burta2/mambaforge/envs/warp_build/etc/conda/
activate.d/ deactivate.d/
I would still add a code path for being prompted to download the weights to a user-space cache if they are not provided at the CLI or in the environment variable.
okay: to summarise
Ideal version of the future
- model weights can be points at from an environment variable called
MEMBRAIN_SEG_WEIGHTS
- if a checkpoint file is provided at the CLI it will override the env variable
- if no checkpoint file is provided at the CLI and no env variable is set then the cache is checked with pooch, if the file is not present the user is prompted to download the weights to the user space cache
sound good?!
I created a "community" on zenodo - could I get your zenodo usernames so we can all curate the record? I need to figure out how to add existing records to a community too :-)
Great, thanks for creating the community! My username is Lorenz_Lamm.
As far as I understand anyone can apply for adding their existing Zenodo entries to a community, so existing records shouldn't be a problem :)
okay: to summarise
Ideal version of the future
- model weights can be points at from an environment variable called
MEMBRAIN_SEG_WEIGHTS
- if a checkpoint file is provided at the CLI it will override the env variable
- if no checkpoint file is provided at the CLI and no env variable is set then the cache is checked with pooch, if the file is not present the user is prompted to download the weights to the user space cache
sound good?!
I like the setting of environment variables, but is there a clever way to edit the activation scripts? (I guess also depends on conda vs virtualenv vs ...). So probably the user would need to do that manually, right? I feel this makes the threshold to use it a bit higher.
Would something like python-dotenv (https://pypi.org/project/python-dotenv/) be an option? We could create an .env file (I know, it's basically a config file :D), but fill it automatically when running membrain-seg and downloading a model to a specified location. This could reduce user input and make it maybe easier to use.
okay: to summarise
Ideal version of the future
* model weights can be points at from an environment variable called `MEMBRAIN_SEG_WEIGHTS` * if a checkpoint file is provided at the CLI it will override the env variable * if no checkpoint file is provided at the CLI and no env variable is set then the cache is checked with pooch, if the file is not present the user is prompted to download the weights to the user space cache
sound good?!
Sounds great to me!
I like the setting of environment variables, but is there a clever way to edit the activation scripts? (I guess also depends on conda vs virtualenv vs ...). So probably the user would need to do that manually, right? I feel this makes the threshold to use it a bit higher.
I agree that python-dotenv
could be a good solution for going this way, but you have to think how to ensure that in a multi-user environment like @alisterburt describes the weights are downloaded only once and the same environment remains available to everyone. This would be achieved by ensuring the path where the model is downloaded is visible to all users, but how will you know what this path is? For a single user it sounds perfectly fine, if that's what you are thinking.
Adding/editing conda activation scripts does add more complexity and it should not be required, but we could still provide instructions for users that want to go this way. I'm afraid we are trying to guess too much about everyone's computational setup, and there's no solution that is gonna work for everyone. That's why I ultimately don't care how this environment variable will be set. For a starter I would just implement the ability to read the weights from MEMBRAIN_SEG_WEIGHTS
.
Most importantly, I think the current way of using membrain-seg (explicit CLI option with the checkpoint) should remain unchanged to minimize disruption to user's workflows.
@rdrighetto do you have a zenodo account?
@LorenzLamm you should have an invite to the zenodo community
I'm with @rdrighetto here, the people setting up the installation in the multi-user environment will be comfortable with setting an environment variable in the activation script. For single-user situations, the weights downloaded/cached with pooch will work transparently. python-dotenv
looks cool but the introduction of extra state to manage feels a little unnecessary in this case and the dotenv looks to be user specific by default so not quite right for a multi-user install solution :-)
Okay, sounds good. Then let's add that as some advanced installation instruction somewhere, but keep the basic functionalities as they are. :)
Thanks for adding me to the Zenodo community. I just accepted the invitation, and submitted the previous upload to the community.
I cannot accept at the moment, though. Maybe it takes a while until my membership status is updated?
Yeah probably takes some time to sync, I accepted the entry in any case :-)