neulab/awesome-align

Trying to train "ixa-ehu/ixambert-base-cased" model

jmurua14 opened this issue · 1 comments

Hi!

You have done a great job!! I have been training two different models. the one mentioned in the title ("ixa-ehu/ixambert-base-cased") and multibert_cased. With the multibert I didn't have any problems with the training, however, when I try to train the other model it says that I have a missmatch with the shape of the vocabulary size.

In the config file of the "ixa-ehu/ixambert-base-cased" model the vocabulary size is the following one:
08/18/2022 09:41:28 - INFO - awesome_align.configuration_utils - Model config BertConfig {
"architectures": null,
"attention_probs_dropout_prob": 0.1,
"bos_token_id": null,
"do_sample": false,
"eos_token_ids": null,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": null,
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 119099
}

When I begin with the training i get this error:
Iteration: 0%| | 0/40000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/datuak/virtualenvs/transformers/bin/awesome-train", line 8, in
sys.exit(main())
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/run_train.py", line 848, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/run_train.py", line 370, in train
loss = model(inputs_src=inputs_src, labels_src=labels_src)
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/awesome_align/modeling.py", line 660, in forward
masked_lm_loss = loss_fct(prediction_scores_src.view(-1, self.config.vocab_size), labels_src.view(-1))
RuntimeError: shape '[-1, 119101]' is invalid for input of size 5716752

As you can see the vocab_size has increased by 2 from 119099 to 119101. This is due to the CLS and SEP tokens, however, I don't know why I get this error. I have tried to manually decrease the vocab_size in the code, but this leads to some other errors when I make the alignments.

I leave you here the awesome-train command I have used for training:
CUDA_VISIBLE_DEVICES=1 awesome-train
--output_dir=$OUTPUT_DIR
--model_name_or_path=ixa-ehu/ixambert-base-cased
--extraction 'softmax'
--do_train
--train_mlm
--train_tlm
--train_tlm_full
--train_so
--train_psi
--train_co
--train_data_file=$TRAIN_FILE
--per_gpu_train_batch_size 2
--gradient_accumulation_steps 4
--num_train_epochs 1
--learning_rate 2e-5
--save_steps 10000
--max_steps 40000 \

Could you please help me solve this issue?

Thanks!

Hi, right now the repo only supports mBERT and XLM-R. You can check this commit to see how to incorporate a new model.