Differences in UmBERTo loaded with AutoModel or AutoModelForTokenClassification

Question

Differences in UmBERTo loaded with AutoModel or AutoModelForTokenClassification

Elidor00 opened this issue 4 years ago · 4 comments

I tried to do two different experiments with UmBERTo: in the first experiment I used run_ner.py script to do an experiment with PoS-tagging. In run_ner.py file, for loading UmBERTo I used

from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    Trainer
)

and then

config = AutoConfig.from_pretrained(...)
tokenizer = AutoTokenizer.from_pretrained(...)
model = AutoModelForTokenClassification.from_pretrained(...)
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

While the only change I made in the file run_ner.sh was where I replaced this line

export BERT_MODEL=bert-base-multilingual-cased

with this line:

export BERT_MODEL=Musixmatch/umberto-wikipedia-uncased-v1

At this point, when I print the UmBERTo model, the output is like this:

CamembertForTokenClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32005, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      ...layers from (1):BertLayer to (11 th):BertLayer are always equal to 0 th...

In the second experiment I used from transformers import AutoTokenizer, AutoModel to load UmBERTo. And then

pretrained_model = "Musixmatch/umberto-wikipedia-uncased-v1"
self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model, do_lower_case=not multi_lingual)
self.model = AutoModel.from_pretrained(pretrained_model).to(self.device)
self.model.eval()

At this point, when I print the UmBERTo model, the output is like this:

CamembertModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(32005, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
     ...layers from (1):BertLayer to (11 th):BertLayer are always equal to 0 th...

Now my doubt is: do these two different ways to load UmBERTo give me exactly the same model?
If so, why is (roberta): RobertaModel( indicated in the first model while not in the second?

I know that being the first model made specifically for the TokenClassification task it has a last layer for fine tuning, but my question concerns everything else in the structure except for that last layer.

Thanks!

Answer 1 · 2020-11-04T12:27:48.000Z

Thank you for your question, @simonefrancia can help you.

Answer 2 · 2020-11-04T13:52:54.000Z

Hi @Elidor00,
we didn't implement the way Umberto had been implemented in HuggingFace lib, but I will try to explain what I know.

Umberto has the same architecture and the same pre-processing as Camembert , so HuggingFace developers chose to integrate umberto using the same model class used for Camembert. They explicitly told us the didn't want to implement a new class specific for Umberto.

But, both Camembert and Umberto model architectures inherits from Roberta LM architecture , so I think you can load Umberto model starting both using Roberta or Camembert Model and it will result the same in terms of weights and model.

So, i think you should not worry about it, because it depends on which way HF developers have configured the class to use for Umberto, in different tasks. If you want to be sure to use always the same class, you could use CamembertModel class instead of AutoModel class. and CamembertForTokenClassification for specific downstream task at token-level.

Hope to be helpful for you

Answer 3 · 2020-11-04T16:13:32.000Z

Hi @simonefrancia,
thank you very much for your very comprehensive answer.

Just out of curiosity I will also do a test using Camembert for both tasks.

Answer 4 · 2020-11-09T09:58:28.000Z

Close issue, but you can re-open it when you want.