embeddings-benchmark/mteb

[MIEB] image file is truncated

Muennighoff opened this issue · 1 comments

Am on the latest of the mieb branch and trying to run python mteb/scripts/run_mieb.py with only CLIP and getting the below 🤔

/env/lib/conda/gritkto4/lib/python3.10/site-packages/PIL/TiffImagePlugin.py:935: UserWarning: Truncated File Read      
  warnings.warn(str(msg))                                                                                              
                                                                                                                       
                                                                                                                       
ERROR:mteb.evaluation.MTEB:Error while evaluating Birdsnap: image file is truncated (45 bytes not processed)           
Traceback (most recent call last):                                                                                     
  File "/data/niklas/mieb/mteb/scripts/run_mieb.py", line 23, in <module>                                              
    results = evaluation.run(model, output_folder="results-mieb-final")                                                
  File "/data/niklas/mieb/mteb/mteb/evaluation/MTEB.py", line 422, in run                                              
    raise e
  File "/data/niklas/mieb/mteb/mteb/evaluation/MTEB.py", line 383, in run
    results, tick, tock = self._run_eval(
  File "/data/niklas/mieb/mteb/mteb/evaluation/MTEB.py", line 260, in _run_eval
    results = task.evaluate(
  File "/data/niklas/mieb/mteb/mteb/abstasks/Image/AbsTaskImageClassification.py", line 99, in evaluate
    scores[hf_subset] = self._evaluate_subset(
  File "/data/niklas/mieb/mteb/mteb/abstasks/Image/AbsTaskImageClassification.py", line 135, in _evaluate_subset
    X_sampled, y_sampled, idxs = self._undersample_data(
  File "/data/niklas/mieb/mteb/mteb/abstasks/Image/AbsTaskImageClassification.py", line 202, in _undersample_data
    label = dataset_split[i][label_column_name]
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2762, in __getitem__
    return self._getitem(key)
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2747, in _getitem
    formatted_output = format_table(
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 639, in format_table
    return formatter(pa_table, query_type=query_type)
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 403, in __call__
    return self.format_row(pa_table)
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 444, in format_row
    row = self.python_features_decoder.decode_row(row)
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 222, in decode_row
    return self.features.decode_example(row) if self.features else row
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/features/features.py", line 2041, in decode_example
    return {
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/features/features.py", line 2042, in <dictcomp>
    column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/features/features.py", line 1403, in decode_nested_example
    return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/datasets/features/image.py", line 188, in decode_example
    image.load()  # to avoid "Too many open files" errors
  File "/env/lib/conda/gritkto4/lib/python3.10/site-packages/PIL/ImageFile.py", line 297, in load
    raise OSError(msg)
OSError: image file is truncated (45 bytes not processed)

Going to upload a downsampled version of the train split with about ~32 pics for each of the 500 bird species. cc @gowitheflow-1998