yangdongchao/LLM-Codec

Question about training process

sphmel opened this issue · 3 comments

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

def encode(self, text):
t5_batch_encoding = self.t5_tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
struct_tokens = t5_batch_encoding["input_ids"].to(self.device)
z = self.t5_transformer(input_ids=struct_tokens).last_hidden_state
return z

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

tmp_g_se_loss = self.loss_fn(z_q_i.mean(2), text_features)

  1. neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

def encode(self, text):
t5_batch_encoding = self.t5_tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
struct_tokens = t5_batch_encoding["input_ids"].to(self.device)
z = self.t5_transformer(input_ids=struct_tokens).last_hidden_state
return z

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

tmp_g_se_loss = self.loss_fn(z_q_i.mean(2), text_features)

  1. neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?
  1. Yes, we use global pooling to extract T5 embedding
  2. We randomly crop a segment to train, due to the encoder is convolution net, so it can encode any length of audio in the inference.

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

Yes, You are right. I am sorry for the late reply. I donot notice this message in the pass days.