embeddings-benchmark/mteb

Error on MTOPDomainClassification task

ZiyiXia opened this issue · 5 comments

When running evaluation on MTEB English benchmark, got following error during the MTOPDomainClassification task:

ERROR:mteb.evaluation.MTEB:Error while evaluating MTOPDomainClassification: Consistency check failed: file should be of size 2191 but has size 2190 ((…)62165c59d59d0034df9fff0bf/mtop_domain.py).
We are sorry for the inconvenience. Please retry with `force_download=True`.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.
Traceback (most recent call last):
  File "/share/project/xzy/test/mteb_eval.py", line 56, in <module>
    evaluation.run(
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 422, in run
    raise e
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/mteb/evaluation/MTEB.py", line 352, in run
    task.load_data(eval_splits=task_eval_splits, **kwargs)
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/mteb/abstasks/MultiSubsetLoader.py", line 15, in load_data
    self.slow_load()
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/mteb/abstasks/MultiSubsetLoader.py", line 44, in slow_load
    self.dataset[lang] = datasets.load_dataset(
                         ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/load.py", line 2606, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/load.py", line 2277, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/load.py", line 1923, in dataset_module_factory
    raise e1 from None
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/load.py", line 1896, in dataset_module_factory
    ).get_module()
      ^^^^^^^^^^^^
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/load.py", line 1507, in get_module
    local_path = self.download_loading_script()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/load.py", line 1467, in download_loading_script
    return cached_path(file_path, download_config=download_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 211, in cached_path
    output_path = get_from_cache(
                  ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 689, in get_from_cache
    fsspec_get(
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 395, in fsspec_get
    fs.get_file(path, temp_file.name, callback=callback)
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py", line 648, in get_file
    http_get(
  File "/root/anaconda3/envs/faiss/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 578, in http_get
    raise EnvironmentError(
OSError: Consistency check failed: file should be of size 2191 but has size 2190 ((…)62165c59d59d0034df9fff0bf/mtop_domain.py).
We are sorry for the inconvenience. Please retry with `force_download=True`.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.

Try to download through HF datasets directly but got the same error as above.

from datasets import load_dataset

data = load_dataset("mteb/mtop_domain", "en", force_download=True)

Any idea how to get over it? Appreciate your help

Thanks for reporting this. I can't reproduce the code locally:

data = load_dataset("mteb/mtop_domain", "en", trust_remote_code=True)
# runs without issues

but I do get an error from using the force_download flag which makes me believe that you are using another version of datasets. If you let me know which one I will ensure that we specify it is requirements

I am using the version:

import datasets
datasets.__version__
# 2.21.0

Thanks for your response. I just checked my datasets version which is also 2.21.0.
Could that be an error during the downloading process? I haven't download the dataset in my environment but you might have it already downloaded? So just load the dataset locally won't run into error.
I will also open an issue at datasets repo to see if people there have any idea what's going on

Tried in a colab notebook, could not reproduce it:
https://colab.research.google.com/drive/1U6_tvysJdH-hiWUEMXN6p9COJoZi8CjG?usp=sharing

It might be worth resetting the Huggingface cache.

Just solved this by reinstall Huggingface Hub and datasets. Thanks for your help!

Wonderful great to hear!