huggingface/huggingface_hub

snapshot_download edge case exception

pcuenca opened this issue · 4 comments

Describe the bug

In some cases (I haven't fully debugged yet, but I have a repro), a repo with subdirectories being downloaded to a "local directory" raises an exception when downloading to a partially complete local snapshot, because parent subdirectories don't exist. This only seems to happen on subsequent downloads in the same process. I think the repro will be clearer, please consider the file structure in this repo being downloaded:

from huggingface_hub import HfApi
import shutil

api = HfApi()
model_id = "pcuenq/t"
local_dir = "t"

def download():
    try:
        api.snapshot_download(repo_id=model_id, local_dir=local_dir)
    except Exception as e:
        print(e)

download()
shutil.rmtree("t/b")
download()

The first download succeeds, but the second one fails (the exception is raised) when attempting to move this file to the b directory.

I have verified that creating the directory before this line "fixes" the problem, but there's surely a better way to deal with this, since the first time the subdirectory is created while the second one it's not.

    dst.parent.mkdir(parents=True, exist_ok=True)
    shutil.move(str(src), str(dst), copy_function=_copy_no_matter_what)

This is occasionally breaking gguf-my-repo (cc @Vaibhavs10) when quantizing models that contain subfolders (there's another bug there where the download patterns are not correctly defined, but I found this issue while debugging it).

Reproduction

As explained above.

Logs

No response

System info

hf_transfer enabled

- huggingface_hub version: 0.25.2
- Platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/pedro/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: pcuenq
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.4.0
- Jinja2: 3.1.2
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.2.0
- hf_transfer: 0.1.5
- gradio: 4.44.1
- tensorboard: 2.6.2.2
- numpy: 1.26.4
- pydantic: 2.0.3
- aiohttp: 3.8.4
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/pedro/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/pedro/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/pedro/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: True
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Happy to submit the PR with the workaround above or dig deeper.

Wow, thanks @pcuenca for providing such a reproducible example (would have taken a lot of time to reproduce^^).

Looks like the error is happening because when getting the local file paths (both metadata and destination), we create the directories here:

file_path.parent.mkdir(parents=True, exist_ok=True)

But as an optimization to avoid reading the disk multiple times when downloading a file, I used a lru_cache here. Therefore if a same script is downloading twice the same repo AND the subfolder has been deleted in between, then the mkdir call is made only once which leads to the error.

This is a corner case to fix but let's fix it right:

  • either we say that get_local_download_paths shouldn't ensure the paths exist => the rest of the code would have to deals with it (such as the suggested solution with the dst.parent.mkdir call)
  • either we don't add the lru_cache => more filesystem calls but is it so problematic?
  • either we add the dst.parent.mkdir call in all cases to ensure double-check => feels clunky

cc @hanouticelina if you see another option

thanks a lot @pcuenca for reporting this with a repro!
@Wauplin in my opinion, the first option is the best, i.e. get_local_download_paths should not create directories or modify the filesystem, as it might lead to caching side effects like this one. as you said, we can keep the @lru_cache but making get_local_download_paths code as "pure" as possible.

My instinct would be to remove the lru cache, as it reduces complexity overall.

For additional context, I'm still not sure why this problem is affecting the gguf-my-repo. I think it's because the Space is a long-lived process that downloads and converts models non-stop, and downloaded snapshots are removed upon completion. It could be the case that one person converts a model, then later another person tries to convert the same and we hit this issue, but I haven't been able to verify if that's the case. So the TL;DR: is: caching interacts with long processes.

A completely different solution would be to provide a context manager that downloads a repo to a random temporary location, people do whatever they need to do and then the context manager removes the downloaded snapshot. This would address this use case, but I feel it's out of scope of what huggingface_hub should do.