ValueError: BuilderConfig 'rte' not found. Available: ['default']

Question

ValueError: BuilderConfig 'rte' not found. Available: ['default']

Synnai opened this issue 8 months ago · 2 comments

I had to manually download the GLUE dataset from the git repo GLUE-baselines, and then I put it in the same directory as the value of cache_dir in utils/load_config.py. However, when I executed python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --multitask_training --auxiliary_dataset_name rte --learning_rate 1e-5 --num_runs 5, an ERROR occurred.

The ERROR log is as follow:

INFO:root:********** Run starts. **********
INFO:root:configuration is Namespace(dataset_name='rte_cola', auxiliary_dataset_name='rte', language_model_name='roberta-base', multitask_training=True, batch_size=16, num_epochs=10, learning_rate=1e-05, gpu=0, num_runs=5, device='cuda:0', target_dataset_name='cola', save_model_dir='./save_models/rte_cola/roberta-base_lr1e-05')
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /roberta-base/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 157614.76it/s]
Traceback (most recent call last):
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/train_plms_glue.py", line 93, in <module>
    glue_data_loader.load_multitask_datasets(dataset_names=dataset_names, train_split_ratio_for_val=0.1, max_seq_length=128)
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 110, in load_multitask_datasets
    multiple_datasets = [self.load_dataset(dataset_name=dataset_name, train_split_ratio_for_val=train_split_ratio_for_val,
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 110, in <listcomp>
    multiple_datasets = [self.load_dataset(dataset_name=dataset_name, train_split_ratio_for_val=train_split_ratio_for_val,
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 76, in load_dataset
    dataset = load_dataset(path=os.path.join(cache_dir, "glue"), name=dataset_name)
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config
    raise ValueError(

At the same time, when I tried to execute python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --learning_rate 1e-5 --num_runs 5, another ERROR occurred:

INFO:root:********** Run starts. **********
INFO:root:configuration is Namespace(dataset_name='cola', auxiliary_dataset_name='cola', language_model_name='roberta-base', multitask_training=False, batch_size=16, num_epochs=10, learning_rate=1e-05, gpu=0, num_runs=5, device='cuda:0', save_model_dir='./save_models/cola/roberta-base_lr1e-05')
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /roberta-base/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 157614.76it/s]
Traceback (most recent call last):
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/train_plms_glue.py", line 154, in <module>
    train_dataset, val_dataset, test_dataset, num_labels = glue_data_loader.load_dataset(dataset_name=args.dataset_name,
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/utils/glue_data_loader.py", line 76, in load_dataset
    dataset = load_dataset(path=os.path.join(cache_dir, "glue"), name=dataset_name)
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/load.py", line 2265, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 371, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/datasets/builder.py", line 592, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'cola' not found. Available: ['default']

Could you please give any advice to fix it?

Answer 1 · 2024-03-31T12:58:16.000Z

Hi,

For the GLUE dataset, we first download the dataset to cache and then load the cached dataset with this line. I think your issue occurs because the dataset you downloaded has some problems.

To fix this, you just have to uncomment this line and comment the previous line, which will automatically download the GLUE dataset to the cache_dir.

Hope this helps! Feel free to ask if there are any further questions.

Answer 2 · 2024-04-01T06:56:44.000Z

Hi,

For the GLUE dataset, we first download the dataset to cache and then load the cached dataset with this line. I think your issue occurs because the dataset you downloaded has some problems.

To fix this, you just have to uncomment this line and comment the previous line, which will automatically download the GLUE dataset to the cache_dir.

Hope this helps! Feel free to ask if there are any further questions.

It works. Thanks a lot!