HF Upload file doesn't work over 10MB

Question

HF Upload file doesn't work over 10MB

orionw opened this issue a month ago · 9 comments

Describe the bug

I can't push a file to hub that's over 10MB, even though I am using the HF API and it should auto-add it. I thought this was updated but for some reason is happening to my repo.

File can't be posted here, but can download here: https://drive.google.com/file/d/1hRIZZHc1pHklobM-wSbH7UPrBJg5pPwy/view?usp=sharing

I have about 2000 of these...

Reproduction

api.upload_file(
                    path_or_fileobj=PATH_TO_FILE,
                    path_in_repo="train/starcoder/starcoder_0040-tokenized-chunked-1024-512-128-backfill-nodups/shard.00056.mds",
                    repo_id="orionweller/generic_data_v2",
                    commit_message=f"Upload missing file: {file}",
                    repo_type="dataset"
                )

Logs

Bad request for commit endpoint:                                                                                                                                                                                                
Your push was rejected because it contains files larger than 10 MiB. Please use https://git-lfs.github.com/ to store large files. See also: https://hf.co/docs/hub/repositories-getting-started#terminal Offending files: - .git
attributes (ref: refs/heads/main)

System info

- huggingface_hub version: 0.26.2
- Platform: Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28
- Python version: 3.11.9
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/oweller2/my_scratch/cache/token
- Has saved token ?: True
- Who am I ?: orionweller
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.4.0
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.3.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: N/A
- aiohttp: 3.10.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/oweller2/my_scratch/cache/hub
- HF_ASSETS_CACHE: /home/oweller2/my_scratch/cache/assets
- HF_TOKEN_PATH: /home/oweller2/my_scratch/cache/token
- HF_STORED_TOKENS_PATH: /home/oweller2/my_scratch/cache/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Answer 1 · 2024-11-18T20:56:48.000Z

Hmm, I manually updated the .gitattributes to add those files (which pushed it over 10MB) but then I consolidated it by doing directories instead of individual paths and now it works.

This is no longer reproducible since I am finishing uploading but it seems like that would be the bug. Once the .gitattributes is close to 10MB, you can't use this API anymore since it will try to add to the .gitattributes which then can't fit because it's > 10mb and thus can't update or allow other files to update.

Answer 2 · 2024-11-19T08:35:29.000Z

Hi @orionw, sorry for the inconvenience. Usually binary files are already tracked by default as LFS so this error is very rare. Also, it seems that it happens only when a repo starts to have over 60k files that are in folders and subfolders with very long names (i.e. really not often 😄).

Anyway, editing the .gitattributes file was indeed the way to go for this! The best solution is to track all .mds files as LFS instead of doing it folder per folder. I have opened a PR that should definitely solve your issue: https://huggingface.co/datasets/orionweller/generic_data_v2/discussions/55

# MDS files
*.mds filter=lfs diff=lfs merge=lfs -text

Given that the hard limit for a repo is 100k files, the 10MB limit on the gitattributes file has very very little chances to happen (and sorry it happened for you^^). Therefore, we will most probably not optimize the Hub for this use case.

Answer 3 · 2024-11-19T08:38:14.000Z

Btw, I saw in https://huggingface.co/datasets/orionweller/generic_data_v2/discussions, that you use a lot the "multi_commit" flag to upload large folders. This was an experimental feature designed to test things in the past. Recently we've updated the workflow to provide a similar feature that is easier to work with. I'd recommend to use the new hf_api.upload_large_folder method or the huggingface-cli upload-large-folder command line. More details here: https://huggingface.co/docs/huggingface_hub/guides/upload#upload-a-large-folder (you'll need to update your huggingface_hub version as well)

Answer 4 · 2024-11-19T08:38:43.000Z

I'm closing this issue now but feel free to let me know if you have any further questions 🤗

Answer 5 · 2024-11-19T11:22:14.000Z

Thanks @Wauplin! I had tried the large folder upload but it failed with this error, so I started doing more granular operations to figure it out. It’s a great new feature!

Do you think it would be possible to add this line to the default HF gitattributes file? I know there is increasing usage of MDS files these days and an issue to even add MDS as a file type used by the data viewer.

Answer 6 · 2024-11-19T12:56:31.000Z

To be honest, I'm not so familiar with MDS format. What is it used for in the ML community?

Answer 7 · 2024-11-19T13:39:18.000Z

Sorry here’s the issue link: huggingface/datasets#6736. Should I raise the issue of a default gitattributes line in the datasets repo or is it a hub issue?

It’s MosiacML’s (now part of Databricks) format for data loading, has a bunch of useful features for pre-training. Quite a few models are trained with it these days, at least in open source.

Answer 8 · 2024-11-19T14:37:06.000Z

Thanks for the context @orionw! I've asked around internally and we decided to automatically track .mds files as LFS in the .gitattributes for all new datasets. This should be effective within a few days :)

Answer 9 · 2024-11-20T12:27:47.000Z

we could do some light joint comms with Databricks that MDS is auto-tracked by default on the HF Hub!