Ask about BERT/XLM-R embeddings
fym0503 opened this issue · 2 comments
Hi, I have read your interesting paper and code. My question is: As BERT and XLM-R has many layers. I wonder what kind of embeddings you use ? Just the word embedding or a mixture of intermediate layer representation ? Did you find the difference between these options ?
Thanks !
Hi,
There are two scenarios:
- If we do not fine-tune the transformer-based embeddings, we find that using the last four layers performs the best in most of the cases (this is verified by some of the previous work).
- If we fine-tune the transformer-based embeddings, in most of the cases, we find that only using the last layer as the embeddings performs the best on the downstream tasks. Therefore, we use the last layer of the transformer as the embedding for the fine-tuned BERT/XLM-R/... in ACE so that the usage of embeddings is identical to their fine-tune processes.
An alternative way is that you may train the model of each embedding with different settings (e.g. the last layer or the last four layers, w/ fine-tuning or w/o fine-tuning) and compare the model accuracy to decide the final usage of each embedding.
By the way, we use the first subtoken as the representation of each token.
Thanks, Your comments are very clear. As I have done some similar experiments, my conclusion is nearly the same.