huggingface/huggingface_hub

Developer mode requirement on Windows

LysandreJik opened this issue ยท 27 comments

The current snapshot_download and hf_hub_download methods currently use symlinks for efficient storage management. However, symlinks are not properly supported on Windows where administrator privileges or Developer Mode needs to be enabled in order to be used.

We chose to take this approach so that it mirrors the linux/osx behavior.

Opening an issue here to track issues encountered by users in the ecosystem:

Thanks for making this issue. Responding to Lysandre's comment here huggingface/transformers#19048 (comment):

No, to use WSL you do not need to enable developer mode since 2017. I have been running WSL2 with CUDA and never had Developer Mode activated. I also only ever use WSL2 if I really cannot get software to run natively on Windows. WSL is great but it has its pitfalls/quirks. Like the interaction with PyCharm of setting up/managing different environments, but also IO which is really slow between /mnt/ and the distro's native drive. I prefer native Windows when I can.

So for most of my transformers projects, I've used Windows, mostly without issue or I posted a PR to fix the problem. Windows users seem plenty. And in my case, I am especially worried for the class room. At the start, I use Colab because it's easy for beginners. But after a while we switch to a local installation and running scripts instead of notebooks so they at least know about those things. I cannot require them to enable developer mode (liability) and WSL will take too much time of a class to get working for everyone.

Outside of the classroom, you can think about anyone who are just getting started with transformers and want to play around with it on their own system - gamers, hobbyists, enthusiasts, or anyone with an interest in AI-related things. Windows users are still the majority (even among developers) and they are not all hardcore PC enthusiast. Even if they are (like me), they may not be eager to have this additional step to allow access to transformers. So my issue here is that instead of making access to AI easier (as Hugging Face has done in the past, for which I am of course grateful), this new change actually makes it harder for a portion of users. I am not sure if the advantages outweigh the disadvantages because everything worked fine before.

For our frontend and many others as well this will be a big issue, it is very hard to explain to users as to why they need developer mode or administrative permissions in order to be able to download a model. This seems very suspicious to a user since they will not be aware of the technical limitation. It would be much nicer if we have control over this, such as being able to use the old cache system if desired.

Posting here to keep an eye on this. Fully agree with the sentiment. Requiring developer mode will limit the ability to use transformers in programs that are given to non-developers. An average user will not want to enable developer mode when it's full of warnings.

I'm curious if this affects many many Windows users, or not. Let's see if more users post about this in the coming days (to see how we could fix it)? Will also ask on Twitter=)

I'm curious if this affects many many Windows users, or not. Let's see if more users post about this in the coming days (to see how we could fix it)? Will also ask on Twitter=)

As @henk717, @ebolam and my posts suggests, the problem is not just with developers. If they really need to and you do not give them a choice, I guess they will enable developer mode or be forced to switch to WSL (but again: I was not eager to enable dev mode on my machine either; it comes with its own risks).

A major problem is the consequence for non-developers: hobbyists, students, end-users who try to run products/demos/tools that use transformers under the hood. Those won't know what to do or will not want to start enabling things in Windows that give you a big security warning. Those are exactly the new people you'd want to get interested, not scare away. The results of both your Twitter poll and the comments on Github will therefore be very biased towards developers.

As @henk717 suggests, keeping the option to use the old cache mechanism would be great.

The results of both your Twitter poll and the comments on Github will therefore be very biased towards developers.

Yes you're right @BramVanroy

Adding my comment here (as suggested by @julien-c). Activating developer mode is asking for trouble. On top of that, I had to drop support for WSL because my developer PC simply cannot handle it (running it will take 75% of available RAM + impossible to shut down once activated, not to mention that I cannot even run dropbox on my windows machine without constant hanging). Turning on WSL and developer mode not only requires users to run a beefy pc, but it also pushes them to security warnings and dialogues that I Pavlov'ed normal users into to pressing "no" for security reasons.

Following this closely as I faced the same issue I posted in the HF forum here:

https://discuss.huggingface.co/t/symlink-error-when-importing-ortseqclass-model-via-pipeline/20706

I am not a developer and was not too comfortable with running PC in dev mode, tried experimenting to see if that would solve it but got the same error after switching dev mode on and doing a full restart.

A major problem is the consequence for non-developers: hobbyists, students, end-users who try to run products/demos/tools that use transformers under the hood. Those won't know what to do or will not want to start enabling things in Windows that give you a big security warning. Those are exactly the new people you'd want to get interested, not scare away. The results of both your Twitter poll and the comments on Github will therefore be very biased towards developers.

Very much this, transformers is a dependency of multiple user friendly or end user experiences. KoboldAI in our case, but on steam there are games like AI Roguelite and AIDventure that also use transformers as a dependency. Explaining a user who buys an end user product in a store front that they manually need to change Windows settings for the program to work correctly is a step to far and should not be automated away since you are then changing security settings.

So a solution that does not require system wide changes is very much desired for applications that are either portable, or installed trough automated means.

Another thing: It's windows now, but what if you would force the user to use root to install the package under Linux? Would you force people to use "sudo" to run transformers, knowing that it uses pytorch which has an exploit in their system that you are currently scanning for?

Hey, thanks all for your comments. We agree that it shouldn't behave this way under Windows and we're working on a solution.

Hi everyone and thank you for the feedback.

Here is a PR to implement a cache-system variant that do not need symlinks if there are not available: #1067. See description for more details. Any feedback is very welcomed ! :)

#1067 is now merged and will be available in next release of huggingface_hub. For now one can pip install from the main branch to run on Windows.
I'll keep this thread updated once the release is made.

@BramVanroy and others in this thread, would you mind please updating your install to use the updated main branch and let us know if it works out for you?

You can update with the following:

pip install git+https://github.com/huggingface/huggingface_hub

Using any from_pretrained method should work after that.

Thank you!

@BramVanroy and others in this thread, would you mind please updating your install to use the updated main branch and let us know if it works out for you?

The implementation works within our software, I did notice a performance regression compared to the old cache however since it takes time (and space) to copy the files to the snapshot folder. Moving the files, or referencing the blob location trough text files (effectively fake symlinks) would be a leaner solution that would impact users with slow storage less.

I did the pip install but I am now getting a different error:

image

Error message:

image

I have the following env variables set:

HF_HOME: D:\nickm\Transformers
TRANSFORMERS_CACHE: D:\nickm\Transformers

Hi @nickmuchi87 , thanks for reporting this bug. I created a PR (#1077) to make the are_symlinks_supported() more reliable. This should solve your problem. Could you try to install from the branch and confirm ?

pip install git+https://github.com/huggingface/huggingface_hub@1062-check-symlink-supported-in-cache-dir

@henk717 We understand your concern. To be fair, we hesitated between two workarounds when symlinks are not supported. Both solutions (fake symlinks or duplicated files) have their own pros and cons. Please let me share the elements on which we based the decision to go for duplicated files:

Fake symlinks

With fake symlinks, a text file containing only a path to the actual blob file is stored under the snapshot path. This way the user has to read the file before going to the blob path to actually get the content for the file.

Pros:

  • minimal impact on storage
  • minimal impact when downloading (no files copied)

Cons:

  • extra logic when reading the file content

Duplicate files

In this approach, we do not use the blob directory anymore but directly copy the files in the snapshot directory.

Pros:

  • no extra logic required once the file is cached
    • this is particularly important for us as huggingface_hub is a package meant to be re-used in a lot of external libraries. We currently have a helper snapshot_download() that downloads a snapshot of a full repo and return the path of the directory in the cache. Once this path is returned, the user expects all the files to be under this directory path. In particular, one can do snapshot_path.glob("*") to iterate over all the files of the snapshot. With fake symlinks we cannot guarantee to the user that the files are "actual" files and not symlinks.
    • having to handle fake symlinks in downstream libraries is really something we want to avoid as we don't have a hand on all of them. Besides, it would be very confusing/frustrating for maintainers to have to debug/investigate a fake-symlink-issue of one of their users if they are not even aware of the fake-symlink specificity (I don't expect everyone to carefully read the documentation on this topic).
  • compatibility: if a Windows user starts using huggingface_hub without developer mode and then enable it after a bit of time, the cache structure and usage will be exactly the same. In particular there will be no need to handle both real symlinks and fake symlinks on the same machine/same cache folder. Caching will be optimized for newly downloaded models but previously downloaded files will still work fine.

Cons:

  • more disk usage when downloading several revisions of the same model
  • more execution time when downloading if a file is copied locally
    • however this is only a corner case if a previous blob file existed in the blob folder (that is then copied to the snapshot folder). This can only happen if the same model was previously downloaded using dev mode/admin rights. If the cache is clean (never used symlinks before), the downloaded file is moved to the snapshot folder using os.replace() which should be instant.

Conclusion

As you may have understood, we really favored an approach that will avoid as much as possible the friction to integrate huggingface_hub everywhere instead of a fully optimized cache-system on all platform/machines. This decision is also motivated by the fact that it is always possible to activate the developer mode on Windows to support symlinks. The comments in this thread were mentioning a lot the necessity to support any Windows machines, especially to allow anyone to use huggingface_hub/transformers for teaching, for experimenting, for studying, in games,... All those use-cases are legit and now supported with no downside. Only users with a more intensive usage (e.g. iterating over revisions of the same model) will be affected and that's where we advise to enable the developer mode. (EDIT: see #1062 (comment) for clarification)

Please also notice that we provide a delete-cache command line to make it easy for users to clean up their disk.

I hope this message helps everyone here (and in the future) understand the workaround we chose to implement and the reasons behind it. Please let me know if you have any further questions.

(Note: for any questions on how the cache-system is working, you can refer to this documentation page)

Hi @nickmuchi87 , thanks for reporting this bug. I created a PR (#1077) to make the are_symlinks_supported() more reliable. This should solve your problem. Could you try to install from the branch and confirm ?

pip install git+https://github.com/huggingface/huggingface_hub@1062-check-symlink-supported-in-cache-dir

I tried installing the above:

image

then restarted my jupyter-lab but getting the same error, not sure what I am doing wrong here:

image

@Wauplin Thanks for the elaborate write up! While I do not have time now to investigate the details (and noting that @nickmuchi87 is already testing your changes), I am extremely satisfied about the quick PR to make transformers tools fully/openly accessible to Windows users again. While I can understand @henk717's comment about disk usage, I am thankful for your elaborate motivation. The decision seems sensible to me.

One question. You write the following:

Only users with a more intensive usage (e.g. iterating over revisions of the same model) will be affected and that's where we advise to enable the developer mode.

Does this mean that specifically only revisions of the same model will be duplicated? Or does this also affect regular models? While I do not use revisions of the same model often, I do switch between a lot of different models often. Does that mean that they will all be duplicated?

(This is more of a hypothetical questions because I ended up enabling developer mode to continue my current work with transformers.)

@nickmuchi87 : I suspect huggingface_hub lib did not install correctly as the last chunk of code in the traceback should look like this:

    cache_dir = os.path.dirname(os.path.commonpath([src, dst]))
    if are_symlinks_supported(cache_dir=cache_dir): # <- this line changed in the new branch !
        os.symlink(relative_src, dst)
    elif new_blob:
        os.replace(src, dst)
    else:
        shutil.copyfile(src, dst)

Could you try uninstalling/reinstalling in your env ? Thanks in advance and sorry for the inconvenience.

pip uninstall huggingface_hub
pip install git+https://github.com/huggingface/huggingface_hub@1062-check-symlink-supported-in-cache-dir

@BramVanroy Thank you for your feedback. Here are some more detailed explanations:

Does this mean that specifically only revisions of the same model will be duplicated?

What I mean by "revisions of the same model" is if you download multiple versions of a single repo.
Let's consider the repo distilgpt2. If you download the model from main, you'll have the model only once in your cache, the same way as other users. If you want to test the model from branch refs/pr/6 (let's say for reviewing the PR), then you will end up with duplicate files on your system. That's what I meant by having "a more intensive usage" than just experiencing around.

(for the record, I mix a lot "repo" and "model" in the wording. In transformers the user only sees "models" but in the background all models are stored as git repositories on the HF Hub)

Or does this also affect regular models? While I do not use revisions of the same model often, I do switch between a lot of different models often. Does that mean that they will all be duplicated?

I not sure what you mean by "regular models". If you switch between different models (e.g. gpt2, distilgpt2, gpt2-large, xlnet-base-cased,...), each model will be individually downloaded in your cache-system as any user. Blob files are never shared between repositories in the cache, no matter the platform. Once they have been downloaded once, you can switch between them as many times as you want without redownloading anything.

Hope this makes it clearer for you now :)

@Wauplin that worked, so I uninstalled and re-installed using Anaconda Prompt this time instead of Jupyter, thanks for that! Appreciate the efforts!

@nickmuchi87 Perfect ! ๐ŸŽ‰

@Wauplin If I understand correctly, this means that indeed if I download many different repos I won't get duplicated data and that this only happens if you download different commits/revisions of the same model repo. So that's a non-issue for me.

Thanks again for your work!

@BramVanroy exactly :)

Fixed by #1067 and #1077 and released in huggingface_hub v0.10. I'm closing this issue.