zhjgao/difformer

ize mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([10096, 128]) from checkpoint, the shape in current model is torch.Size([151000, 128]).

Closed this issue · 4 comments

     你好,我训练epoch 608时,就训练结束了:

2023-07-02 01:57:38 | INFO | fairseq_cli.train | end of epoch 608 (average epoch stats below)
2023-07-02 01:57:38 | INFO | train | {"epoch": 608, "train_loss": "3.663", "train_nll_loss": "1.275", "train_diffusion": "0.424", "train_word_ins": "3.224", "train_length": "0.151", "train_ppl": "12.67", "train_bleu": "0", "train_wps": "23834.3", "train_ups": "3.3", "train_wpb": "7233.2", "train_bsz": "309.9", "train_num_updates": "300000", "train_lr": "9.12871e-05", "train_gnorm": "2.611", "train_clip": "97.3", "train_loss_scale": "16384", "train_train_wall": "25", "train_wall": "55974"}
2023-07-02 01:57:38 | INFO | fairseq_cli.train | done training in 55973.7 seconds
并且请问checkpoint_last.pt文件就是difformer.pt吗?
我下载了您提供的difformer.pt和transformer.pt文件运行却报错:
RuntimeError: Error(s) in loading state_dict for Difformer:
size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([10096, 128]) from checkpoint, the shape in current model is torch.Size([151000, 128]).
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([10096, 128]) from checkpoint, the shape in current model is torch.Size([151000, 128]).
size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([10096, 128]) from checkpoint, the shape in current model is torch.Size([151000, 128]).
Finished evaluate_step20_beam7x3. BLEU:
2023-07-02 12:54:26 | INFO | fairseq_cli.generate | loading model(s) from models/iwslt14_de_en/difformer.pt:models/iwslt14_de_en/transformer.pt
恕我愚钝,请问您能提供一些意见吗?

zhjgao commented

Hi, thanks for your interest.

The transformer.pt should share the same vocabulary with the difformer.pt. So if you train a model using your own vocabulary, a new Transformer trained with the same vocabulary is required.

X-fxx commented

想问一下,你最终找到哪一个是difformer.pt文件了吗

zhjgao commented

想问一下,你最终找到哪一个是difformer.pt文件了吗

  1. If you download our released checkpoints here, the difformer.pt can be find at the difformer_release/models/<dataset>/difformer.pt.
  2. If you would like to train your own model, checkpoints are at models/<dataset>/<model name>/ckpt, among which checkpoint_last.pt is the checkpoint saved after the training finished, and checkpoint_best.pt is the best checkpoint according to evaluation bleu score.