Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder?
st-vincent1 opened this issue · 0 comments
st-vincent1 commented
I am confused why the speaker embedding g
is used to condition multiple model components (Posterior Encoder, Decoder, Flow) as opposed to just Flow.
From the model diagram in Fig. 1 (a) (Training procedure), the speaker embedding g
is used to condition the normalising Flow. This makes sense: at inference time, this information in the reversed Flow to reverse the z'
distribution into a speaker-informed z
which was modelled after the real data x_lin
with the Posterior Encoder.
To me this seems like enough supervision, and I am confused why g
is used in other places too:
- in Posterior Encoder which uses
x_lin
as input,g
is also supplied - but it shouldn't be needed asx_lin
already contains the speaker information! (Andg
is not mentioned in section 2.2.2. of the paper when this encoder is discussed) - in Decoder, similarly,
z
is already informed with the speaker embedding, so why do we need to explicitly supply it here?