Are Mention Flag Embedding tables shared between different layers?
Leadlegend opened this issue · 3 comments
Details
In the code ACL2021MF t5.py
by GaryYufei, I noticed that mention flag embedding is declared in class T5Attention, which means that every transformer layer has different and independent embedding tables, and I didn't find any code about special initialization part about them.
class T5Attention(nn.Module):
def __init__(self, config: T5Config, has_relative_attention_bias=False, is_bidirectional=False, use_mention_flag=False, mention_flag_num=3, use_orginal_enc_pos_embs=True, use_mf_scalar=False):
super().__init__()
"""
"""
self.has_mention_flag = False
self.use_mf_scalar = use_mf_scalar
if self.use_mention_flag and mention_flag_num > 0:
if not self.use_mf_scalar:
self.k_mention_flag = nn.Embedding(mention_flag_num, self.d_kv)
self.v_mention_flag = nn.Embedding(mention_flag_num, self.d_kv)
else:
self.mention_flag_scalar = nn.Embedding(mention_flag_num, self.n_heads)
My Question is, why can't we embed MF matrix uniformly in T5 decoder (such as class T5Block
), which is like trainable absolute position embedding of BERT or GPT-2? And will the different and independent embedding influence the understanding of MF's meaning, or we just simply initialize with the same and frozen parameters for all the MF embedding table?
In the paper, the following details are pretty sketchy so I couldn't find any explanation.
Thanks for your interests in our research work. In the MF paper, Table 5, we did run shared MF which is shared by all T5 layers. It performs slightly worse than having separated MF for each layer.
In general, both settings are OK for PLMs to understand MF.
Let me know if you would like to know more about this!
Sorry for my negligence and thanks a lot.
It's good to know that people are interested in our work! I will close the issue now. Do hesitate to contact me if you would like to know more!