Question about training process

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

Lines 73 to 78 in e21c1bf

    
           def encode(self, text): 
        
               t5_batch_encoding = self.t5_tokenizer(text, truncation=True, max_length=self.max_length, return_length=True, 
        
                                               return_overflowing_tokens=False, padding="max_length", return_tensors="pt") 
        
               struct_tokens = t5_batch_encoding["input_ids"].to(self.device) 
        
               z = self.t5_transformer(input_ids=struct_tokens).last_hidden_state 
        
               return z

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

LLM-Codec/codec/vq.py

Line 113 in e21c1bf

tmp_g_se_loss = self.loss_fn(z_q_i.mean(2), text_features)

neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?

it seems T5 embedding from FrozenT5 has shape (B, max_length, D)

LLM-Codec/codec/MSCodec.py

Lines 73 to 78 in e21c1bf

def encode(self, text):

t5_batch_encoding = self.t5_tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,

return_overflowing_tokens=False, padding="max_length", return_tensors="pt")

struct_tokens = t5_batch_encoding["input_ids"].to(self.device)

z = self.t5_transformer(input_ids=struct_tokens).last_hidden_state

return z

is text_feature used for semantic loss in quantizer mean-pooled T5 embedding from FrozenT5??

LLM-Codec/codec/vq.py

Line 113 in e21c1bf

tmp_g_se_loss = self.loss_fn(z_q_i.mean(2), text_features)

neural codecs, vocoders are usually trained using random segment of audio. is LLM-codec also trained using random segments? or whole audio?

Yes, we use global pooling to extract T5 embedding
We randomly crop a segment to train, due to the encoder is convolution net, so it can encode any length of audio in the inference.

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

@yangdongchao Thanks for fast reply. So, T5-embedding is from padded whole utterance's transcript or caption, and quantized latent is from random crop?

By the way, I really like this approach, injecting subword, word-level info directly into codec.

Yes, You are right. I am sorry for the late reply. I donot notice this message in the pass days.

	def encode(self, text):
	t5_batch_encoding = self.t5_tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
	return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
	struct_tokens = t5_batch_encoding["input_ids"].to(self.device)
	z = self.t5_transformer(input_ids=struct_tokens).last_hidden_state
	return z