LUMIA-Group/rasat

Error Message "size mismatch for relation_k_emb.weight" when i'm trying load a training models using t5-small

kanseaveg opened this issue · 0 comments

I am rasat running on two consumer-grade graphics cards.The pre-trained model I am implementing is t5-small.And successfully executed the following command: CUDA_VISIBLE_DEVICES="0,1" python3 -m torch.distributed.launch --nnodes=1 --nproc_per_node=2 seq2seq/run_seq2seq.py configs/spider/train_spider_rasat_small.json

tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
208210 ***** eval metrics *****
208211   epoch                   =    3071.95
208212   eval_exact_match        =     0.5348
208213   eval_exec               =     0.5387
208214   eval_loss               =     0.7128
208215   eval_runtime            = 0:02:24.19
208216   eval_samples            =       1034
208217   eval_samples_per_second =      7.171
208218 100% 65/65 [02:22<00:00,  2.20s/it]<__array_function__ internals>:5: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different l

However, when I was evaluating, as mentioned in the train configuration file, I set the evaluation model path to "./experiment/train_spider_rasat_small",.

I encountered an error when executing the evaluation command
python3 seq2seq/eval_run_seq2seq.py configs/spider/eval_spider_rasat_4160.json

The error message is:

Dataset name: spider
Mode: dev
Databases has been preprocessed. Use cache.
Dataset has been preprocessed. Use cache.
Dataset: spider
Mode: dev
Match Questions...
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1034/1034 [00:01<00:00, 606.60it/s]Question match errors: 0/1034
Match Table, Columns, DB Contents...
1034it [00:01, 614.75it/s]
DB match errors: 0/1034
Generate Relations...
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1034/1034 [00:10<00:00, 95.10it/s]Edge match errors: 0/2340638
06/28/2023 20:30:11 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at ./transformers_cache/spider/spider/1.0.0/a9000e8b37ea883ad113d628d95c9067385cc1105e2641a44bfa3090483dbb9b/cache-21e2b8bdcac7ddca.arrow
===================================================
Num of relations uesd in RASAT is :  45
===================================================
Use relation model.
./experiment/train_spider_rasat_small
Traceback (most recent call last):
  File "seq2seq/eval_run_seq2seq.py", line 320, in <module>
    main()
  File "seq2seq/eval_run_seq2seq.py", line 208, in main
    model = nn.DataParallel(model_cls_wrapper(T5ForConditionalGeneration).from_pretrained(
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1453, in from_pretrained
    model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_state_dict_into_model(
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1607, in _load_state_dict_into_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for T5ForConditionalGeneration:
        size mismatch for relation_k_emb.weight: copying a param with shape torch.Size([49, 64]) from checkpoint, the shape in current model is torch.Size([46, 64]).
        size mismatch for relation_v_emb.weight: copying a param with shape torch.Size([49, 64]) from checkpoint, the shape in current model is torch.Size([46, 64]).
        size mismatch for encoder.relation_k_emb.weight: copying a param with shape torch.Size([49, 64]) from checkpoint, the shape in current model is torch.Size([46, 64]).
        size mismatch for encoder.relation_v_emb.weight: copying a param with shape torch.Size([49, 64]) from checkpoint, the shape in current model is torch.Size([46, 64]).

wandb: Waiting for W&B process to finish, PID 310089... (failed 1). Press ctrl-c to abort syncing.

Could you please check and see where the error occurred? Thank you.