Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder?

Question

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder?

st-vincent1 opened this issue a year ago · 0 comments

I am confused why the speaker embedding g is used to condition multiple model components (Posterior Encoder, Decoder, Flow) as opposed to just Flow.

From the model diagram in Fig. 1 (a) (Training procedure), the speaker embedding g is used to condition the normalising Flow. This makes sense: at inference time, this information in the reversed Flow to reverse the z' distribution into a speaker-informed z which was modelled after the real data x_lin with the Posterior Encoder.

To me this seems like enough supervision, and I am confused why g is used in other places too:

in Posterior Encoder which uses x_lin as input, g is also supplied - but it shouldn't be needed as x_lin already contains the speaker information! (And g is not mentioned in section 2.2.2. of the paper when this encoder is discussed)
in Decoder, similarly, z is already informed with the speaker embedding, so why do we need to explicitly supply it here?