请教下music vocabulary size of 8192的实现

Question

请教下music vocabulary size of 8192的实现

Closed this issue a month ago · 1 comments

我看生成music code的代码 tokens = encode_music_by_path(music.strip(), self.music_sample_rate, self.music_tokenizer, self.music_processor, self.device, segment_duration=self.music_segment_duration, one_channel=True, start_from_begin=True) tokens = tokens[0][0] processed_inputs = modality_tokens_to_string(tokens=tokens, modality="music")
而论文提到‘quantized using an RVQ with four quantizers, each with a codebook size of 2048, resulting in a
combined music vocabulary size of 8192.’请问是在下面这行代码实现的吗：
processed_inputs = modality_tokens_to_string(tokens=tokens, modality="music") ，是因为要4个codebook才搞4层的吗？

Answer 1 · 2024-05-22T01:50:57.000Z

离散化是在encode_music_by_path中实现的，用几层取决于选择的codec的设置，可以多用几层也可以少用，有个序列长度和质量的tradeoff