Question about training process
sphmel opened this issue · 3 comments
it seems T5 embedding from FrozenT5 has shape (B, max_length, D)
Lines 73 to 78 in e21c1bf
is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??
Line 113 in e21c1bf
- neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?
it seems T5 embedding from FrozenT5 has shape (B, max_length, D)
Lines 73 to 78 in e21c1bf
is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??
Line 113 in e21c1bf
- neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?
- Yes, we use global pooling to extract T5 embedding
- We randomly crop a segment to train, due to the encoder is convolution net, so it can encode any length of audio in the inference.
@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?
By the way, I really like this approach, injecting subword, word-level info directly into codec.
@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?
By the way, I really like this approach, injecting subword, word-level info directly into codec.
Yes, You are right. I am sorry for the late reply. I donot notice this message in the pass days.