Positional Embeddings should be MuReadout parameters ?
codedecde opened this issue · 2 comments
codedecde commented
Duplicate of question asked on the mutransformers repository (link)
Hi !
I was wondering if (learned) positional embeddings should be MuReadout layers, since they map to a finite dimensional space. Specifically
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
thegregyang commented
Position embedding maps to an infinite dimension (config.hidden_size). Why
do you say its finite?
Yes routing gate should be MuReadout.
…On Thu, Jun 1, 2023, 12:41 AM Barun Patra ***@***.***> wrote:
Duplicate of question asked on the mutransformers repository (link
<microsoft/mutransformers#3 (comment)>)
Hi !
I was wondering if (learned) positional embeddings should be MuReadout
layers, since they map to a finite dimensional space. Specifically
https://github.com/microsoft/mutransformers/blob/480287ce7b18a07a3432e8f2fbc0f0e5b71e2599/mutransformers/models/bert/modeling_bert.py#L174
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
In addition to that, did you try using muP for sparse MoE models ? Am
curious about any findings for those. Specifically, I was wondering if the
routing gate (hdim, num_experts) would also be a MuReadout layer (if we
don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
—
Reply to this email directly, view it on GitHub
<#48>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMWHHM6LLOILTYMSMQ46TP3XI7CLLANCNFSM6AAAAAAYWDHITI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
codedecde commented
Thank you !
I meant that the sequence length aspect is finite (similar to vocab size) ?