AssertionError: cannot find file trainer_state.json!

Question

AssertionError: cannot find file trainer_state.json!

Synnai opened this issue 8 months ago · 10 comments

Thanks for your help! Yet I encountered another problem when I tried to do inference and model merging. The problems were similar:

For model inference: python inference_plms_glue.py --language_model_name roberta-base --weight_mask_rate 0.9 --use_weight_rescale:

Traceback (most recent call last):
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/inference_plms_glue.py", line 107, in <module>
    assert os.path.exists(os.path.join(training_args.output_dir, "trainer_state.json")), "cannot find file trainer_state.json!"
AssertionError: cannot find file trainer_state.json!
wandb: \ 0.019 MB of 0.030 MB uploaded
wandb: Run history:
wandb:                 eval/loss ▁
wandb: eval/matthews_correlation ▁
wandb:              eval/runtime ▁
wandb:   eval/samples_per_second ▁
wandb:     eval/steps_per_second ▁
wandb:         train/global_step ▁
wandb: 
wandb: Run summary:
wandb:                 eval/loss 0.66263
wandb: eval/matthews_correlation 0.577
wandb:              eval/runtime 4.0943
wandb:   eval/samples_per_second 254.743
wandb:     eval/steps_per_second 8.06
wandb:         train/global_step 0
wandb: 
wandb:  View run earthy-morning-5 at: https://wandb.ai/seiunskye/huggingface/runs/741fv3j8
wandb: ️⚡ View job at https://wandb.ai/seiunskye/huggingface/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjE1NDczNzUwMw==/version_details/v0
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240401_145610-741fv3j8/logs
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): o151352.ingest.sentry.io:443

For model merging python merge_plms_glue.py --merging_method_name average_merging --language_model_name roberta-base

/home/dell7960/PycharmProjects/VisionLaSeR/.venv/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 872/872 [00:00<00:00, 22382.63 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 872/872 [00:00<00:00, 19705.15 examples/s]
Traceback (most recent call last):
  File "/home/dell7960/PycharmProjects/DARE/MergeLM/merge_plms_glue.py", line 165, in <module>
    assert os.path.exists(os.path.join(training_args.output_dir, "trainer_state.json")), "cannot find file trainer_state.json!"
AssertionError: cannot find file trainer_state.json!

Could you please give any advice to fix it?

Answer 1 · 2024-04-01T07:44:40.000Z

Can you provide the training command you used?
This issue arises because the model you loaded for inference is mismatched with the model you trained, and thus it cannot be found.

Answer 2 · 2024-04-01T08:08:17.000Z

Can you provide the training command you used? This issue arises because the model you loaded for inference is mismatched with the model you trained, and thus it cannot be found.

I just executed python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --learning_rate 1e-5 --num_runs 5 in your example given in README.MD

Answer 3 · 2024-04-01T09:02:49.000Z

Maybe your issue can be solved by the following solutions:

Make sure that your training process is completed. After executing the training command, you should find the saved files like pytorch_model.bin, trainer_state.json, etc.
In our implementation, the prerequisite for using inference_plms_glue.py is that you have already trained the language model (e.g., roberta-base) on all the eight GLUE datasets since we use a loop to run over the eight datasets at one execution. If you only want to run the inference on part of datasets (e.g., cola), you can simply change the dataset_names here to ["cola"].
As we found that the models would achieve their best performance with different learning rates on different datasets, please make sure the learning rate you use to train the language model matches the mapping relations we found.

Hope this answer helps. Feel free to ask if there are still any further questions.

Answer 4 · 2024-04-01T09:54:11.000Z

Maybe your issue can be solved by the following solutions:

Make sure that your training process is completed. After executing the training command, you should find the saved files like pytorch_model.bin, trainer_state.json, etc.

In our implementation, the prerequisite for using inference_plms_glue.py is that you have already trained the language model (e.g., roberta-base) on all the eight GLUE datasets since we use a loop to run over the eight datasets at one execution. If you only want to run the inference on part of datasets (e.g., cola), you can simply change the dataset_names here to ["cola"].

As we found that the models would achieve their best performance with different learning rates on different datasets, please make sure the learning rate you use to train the language model matches the mapping relations we found.

Hope this answer helps. Feel free to ask if there are still any further questions.

Thanks. But does "trained the language model (e.g., roberta-base) on all the eight GLUE datasets" means continuously executing python train_plms_glue.py --language_model_name roberta-base --dataset_name ${data} --learning_rate 1e-5 --num_runs 5 for all ${data} in ["cola", "sst2", "mrpc", "stsb", "qqp", "mnli", "qnli", "rte"],

or python train_plms_glue.py --language_model_name roberta-base --dataset_name cola --multitask_training --auxiliary_dataset_name ${data} --learning_rate 1e-5 --num_runs 5 for all ${data} in ["sst2", "mrpc", "stsb", "qqp", "mnli", "qnli", "rte"],

or something else?

Answer 5 · 2024-04-01T10:55:50.000Z

You can run python train_plms_glue.py --language_model_name roberta-base --dataset_name ${data} --learning_rate ${lr}$ --num_runs 5 for ${data} in ["cola", "sst2", "mrpc", "stsb", "qqp", "mnli", "qnli", "rte"]. Remember you need to set the learning rate ${lr} according to the mapping relations to reproduce the results in our paper.

Moreover, please note that --multitask_training is for training model in multi-task learning setting, which is only used for reporting the results for encoder-based LMs in Figure 5. The inference and merging experiments do not need this setting.

Answer 6 · 2024-04-03T06:02:45.000Z

You can run python train_plms_glue.py --language_model_name roberta-base --dataset_name ${data} --learning_rate ${lr}$ --num_runs 5 for ${data} in ["cola", "sst2", "mrpc", "stsb", "qqp", "mnli", "qnli", "rte"]. Remember you need to set the learning rate ${lr} according to the mapping relations to reproduce the results in our paper.

Moreover, please note that --multitask_training is for training model in multi-task learning setting, which is only used for reporting the results for encoder-based LMs in Figure 5. The inference and merging experiments do not need this setting.

Thanks for your patience. I'm currently trainning the roberta-base language model on all the eight GLUE datasets, but my disk space is running out. Could you please tell me how much disk and RAM space did you cost when finishing the whole training, inference and merging process? I'm currently using Nvidia RTX 3090 with graphics memory about 24 GB and RAM space about 125GB so I'm afraid that if it's possible to successfully reproduce the results in your paper.

Thanks AGAIN.

Answer 7 · 2024-04-03T09:00:12.000Z

I think Nvidia RTX 3090 with 24 GB (GPU) and about 125GB RAM (CPU) satisfies the training/inference/merging requirements of roberta-base.
Considering the disk space, since we save the checkpoint at each epoch, 10 checkpoints together with the final best checkpoint will be saved after training, which would approximately take 15G disk space for roberta-base on each dataset. Therefore, 15 * 8 = 120GB for all the eight datasets. If your disk space is not enough, you can manually delete the intermediate checkpoints and only maintain the best checkpoint, taking only about 1GB disk space for each dataset.

Answer 8 · 2024-04-03T11:30:40.000Z

I think Nvidia RTX 3090 with 24 GB (GPU) and about 125GB RAM (CPU) satisfies the training/inference/merging requirements of roberta-base.

Considering the disk space, since we save the checkpoint at each epoch, 10 checkpoints together with the final best checkpoint will be saved after training, which would approximately take 15G disk space for roberta-base on each dataset. Therefore, 15 * 8 = 120GB for all the eight datasets. If your disk space is not enough, you can manually delete the intermediate checkpoints and only maintain the best checkpoint, taking only about 1GB disk space for each dataset.

Got it. But according to issue 6 here, my hardware environment may not support further experiments on Wizardlm series, right?

Answer 9 · 2024-04-03T12:40:38.000Z

For experiments on decoder-based LLMs, I think your device can support the inference process of 7B LLMs.

However, our current implementation of model merging on LLMs requires more RAM than your device can provide, making it infeasible for you to conduct the merging experiments. But there is still another solution. The mergekit toolkit has integrated our work with a much more memory-efficient implementation. You can first merge the LLMs based on this toolkit and then evaluate the merged LLM based on our evaluation code. I think this can work as well.

Answer 10 · 2024-04-09T12:56:11.000Z

Close this issue now.

Please feel free to reopen it when there are any further questions.