OSError: Consistency check failed: file should be of size 18612 but has size 18605 (datasets/tau/scrolls@main/scrolls.py).
Closed this issue · 4 comments
Describe the bug
`from datasets import load_dataset, DownloadConfig
from datasets import Dataset
Dataset.cleanup_cache_files
scrolls_datasets = ["quality"]
download_config = DownloadConfig(force_download=True)
data = [load_dataset("tau/scrolls", dataset, force_download=True, download_config=download_config) for dataset in scrolls_datasets]`
Reproduction
No response
Logs
$python main.py
Downloading builder script: 20%|██████████████████▌ | 3.68k/18.6k [00:00<00:01, 11.4kB/s]Traceback (most recent call last):
File "main.py", line 12, in <module>
data = [load_dataset("tau/scrolls", dataset, force_download=True, download_config=download_config) for dataset in scrolls_datasets]
File "main.py", line 12, in <listcomp>
data = [load_dataset("tau/scrolls", dataset, force_download=True, download_config=download_config) for dataset in scrolls_datasets]
File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 2606, in load_dataset
builder_instance = load_dataset_builder(
File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 2277, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1923, in dataset_module_factory
raise e1 from None
File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1889, in dataset_module_factory
return HubDatasetModuleFactoryWithScript(
File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1507, in get_module
local_path = self.download_loading_script()
File "/opt/conda/lib/python3.8/site-packages/datasets/load.py", line 1467, in download_loading_script
return cached_path(file_path, download_config=download_config)
File "/opt/conda/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 211, in cached_path
output_path = get_from_cache(
File "/opt/conda/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 690, in get_from_cache
fsspec_get(
File "/opt/conda/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 396, in fsspec_get
fs.get_file(path, temp_file.name, callback=callback)
File "/opt/conda/lib/python3.8/site-packages/huggingface_hub/hf_file_system.py", line 640, in get_file
http_get(
File "/opt/conda/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 570, in http_get
raise EnvironmentError(
OSError: Consistency check failed: file should be of size 18612 but has size 18605 (datasets/tau/scrolls@main/scrolls.py).
We are sorry for the inconvenience. Please retry with `force_download=True`.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.
Downloading builder script: 100%|█████████████████████████████████████████████████████████████████████████████████████████████▉| 18.6k/18.6k [00:00<00:00, 36.0kB/s]
System info
- huggingface_hub version: 0.25.1
- Platform: Linux-4.9.151-015.ali3000.alios7.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.18
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /ossfs/workspace/hf_hub/token
- Has saved token ?: True
- Who am I ?: hukaiqin
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.0
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 9.3.0
- hf_transfer: N/A
- gradio: 4.13.0
- tensorboard: 2.6
- numpy: 1.23.5
- pydantic: 2.5.3
- aiohttp: 3.9.1
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /ossfs/workspace/hf_hub/hub
- HF_ASSETS_CACHE: /ossfs/workspace/hf_hub/assets
- HF_TOKEN_PATH: /ossfs/workspace/hf_hub/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
Hi @kaiqinhu, sorry for the inconvenience. This is usually due to a network issue while downloading. Can you retry with force_download=True
or on a different network and let us know if the same error happens again (on the same file). Thanks in advance
Thanks for responding, but I already set force_download=True in 'load_dataset()', and I can't change the network because of server-cluster settings.
Can you try to run
from huggingface_hub import hf_hub_download
hf_hub_download("tau/scrolls", filename="scrolls.py", repo_type="dataset", force_download=True)
to check if it does the same? (it has less hidden logic)
closing this issue due to inactivity. please feel free to reopen or create a new issue if needed.