huggingface/huggingface_hub

OSError: Consistency check failed: file should be of size 18612 but has size 18605 (datasets/tau/scrolls@main/scrolls.py).

Closed this issue · 4 comments

Describe the bug

`from datasets import load_dataset, DownloadConfig
from datasets import Dataset
Dataset.cleanup_cache_files

scrolls_datasets = ["quality"]
download_config = DownloadConfig(force_download=True)
data = [load_dataset("tau/scrolls", dataset, force_download=True, download_config=download_config) for dataset in scrolls_datasets]`

Reproduction

No response

Logs

$python main.py 
Downloading builder script:  20%|██████████████████▌                                                                           | 3.68k/18.6k [00:00<00:01, 11.4kB/s]Traceback (most recent call last):
  File "main.py", line 12, in <module>
    data = [load_dataset("tau/scrolls", dataset, force_download=True, download_config=download_config) for dataset in scrolls_datasets]
  File "main.py", line 12, in <listcomp>
    data = [load_dataset("tau/scrolls", dataset, force_download=True, download_config=download_config) for dataset in scrolls_datasets]
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 2606, in load_dataset
    builder_instance = load_dataset_builder(
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 2277, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1923, in dataset_module_factory
    raise e1 from None
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1889, in dataset_module_factory
    return HubDatasetModuleFactoryWithScript(
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1507, in get_module
    local_path = self.download_loading_script()
  File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1467, in download_loading_script
    return cached_path(file_path, download_config=download_config)
  File "/opt/conda/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 211, in cached_path
    output_path = get_from_cache(
  File "/opt/conda/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 690, in get_from_cache
    fsspec_get(
  File "/opt/conda/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 396, in fsspec_get
    fs.get_file(path, temp_file.name, callback=callback)
  File "/opt/conda/lib/python3.8/site-packages/huggingface_hub/hf_file_system.py", line 640, in get_file
    http_get(
  File "/opt/conda/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 570, in http_get
    raise EnvironmentError(
OSError: Consistency check failed: file should be of size 18612 but has size 18605 (datasets/tau/scrolls@main/scrolls.py).
We are sorry for the inconvenience. Please retry with `force_download=True`.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.
Downloading builder script: 100%|█████████████████████████████████████████████████████████████████████████████████████████████▉| 18.6k/18.6k [00:00<00:00, 36.0kB/s]

System info

- huggingface_hub version: 0.25.1
- Platform: Linux-4.9.151-015.ali3000.alios7.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.18
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /ossfs/workspace/hf_hub/token
- Has saved token ?: True
- Who am I ?: hukaiqin
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.0
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 9.3.0
- hf_transfer: N/A
- gradio: 4.13.0
- tensorboard: 2.6
- numpy: 1.23.5
- pydantic: 2.5.3
- aiohttp: 3.9.1
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /ossfs/workspace/hf_hub/hub
- HF_ASSETS_CACHE: /ossfs/workspace/hf_hub/assets
- HF_TOKEN_PATH: /ossfs/workspace/hf_hub/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Hi @kaiqinhu, sorry for the inconvenience. This is usually due to a network issue while downloading. Can you retry with force_download=True or on a different network and let us know if the same error happens again (on the same file). Thanks in advance

Thanks for responding, but I already set force_download=True in 'load_dataset()', and I can't change the network because of server-cluster settings.

Can you try to run

from huggingface_hub import hf_hub_download

hf_hub_download("tau/scrolls", filename="scrolls.py", repo_type="dataset", force_download=True)

to check if it does the same? (it has less hidden logic)

closing this issue due to inactivity. please feel free to reopen or create a new issue if needed.