OpenMOSS/AnyGPT

请教下music vocabulary size of 8192的实现

Closed this issue · 1 comments

我看生成music code的代码 tokens = encode_music_by_path(music.strip(), self.music_sample_rate, self.music_tokenizer, self.music_processor, self.device, segment_duration=self.music_segment_duration, one_channel=True, start_from_begin=True) tokens = tokens[0][0] processed_inputs = modality_tokens_to_string(tokens=tokens, modality="music")
而论文提到‘quantized using an RVQ with four quantizers, each with a codebook size of 2048, resulting in a
combined music vocabulary size of 8192.’请问是在下面这行代码实现的吗:
processed_inputs = modality_tokens_to_string(tokens=tokens, modality="music") ,是因为要4个codebook才搞4层的吗?

离散化是在encode_music_by_path中实现的,用几层取决于选择的codec的设置,可以多用几层也可以少用,有个序列长度和质量的tradeoff