Pretrained models

Question

Pretrained models

Nian-Chen opened this issue 2 months ago · 6 comments

According to the content of offset_pretrained_checkpoints.json, there should be four models, but there is only one model available on Baiduyun. Running infer.sh to load some models results in an error. Hope this can be resolved. Thanks!

Answer 1 · 2024-09-19T04:50:57.000Z

Hi, you can download them under link provided in readme.md
as illustrated, they should be:
flan-t5-large

clap_music

roberta-base

others

Answer 2 · 2024-09-20T07:32:50.000Z

Hi, you can download them under link provided in readme.md as illustrated, they should be: flan-t5-large

clap_music

roberta-base

others
Hi! An error occurred during running infer.sh:
Non-fatal Warning [dataset.py]: The wav path " " is not find in the metadata. Use empty waveform instead. This is normal in the inference process. Error encounter during audio feature extraction: mel() takes 0 positional arguments but 5 were given
Theoretically, wav path should not be used. Where do I need to modify the code?

Answer 3 · 2024-09-20T09:59:59.000Z

Hi, you can download them under link provided in readme.md as illustrated, they should be: flan-t5-large
clap_music
roberta-base
others
Hi! An error occurred during running infer.sh:
Non-fatal Warning [dataset.py]: The wav path " " is not find in the metadata. Use empty waveform instead. This is normal in the inference process. Error encounter during audio feature extraction: mel() takes 0 positional arguments but 5 were given
Theoretically, wav path should not be used. Where do I need to modify the code?

I also met this problem before, this is a problem caused by Library function version of librosa.
My solution was to modify it directly in the library package：
use: mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)
instead of mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) can help

Answer 4 · 2024-09-23T02:39:09.000Z

Thank you for your reply. Now that it's done, there are two questions that need to be confirmed with you:

The GPU memory occupies 25G.
It takes 5 minutes to generate a 10s audio;
At present, the cost is quite high. Are these normal? In addition, do you think there are any advantages of this project compared with MusicGen? The sound quality seems better?

Answer 5 · 2024-09-23T08:30:56.000Z

Thank you for your reply. Now that it's done, there are two questions that need to be confirmed with you:

The GPU memory occupies 25G.

It takes 5 minutes to generate a 10s audio;
At present, the cost is quite high. Are these normal? In addition, do you think there are any advantages of this project compared with MusicGen? The sound quality seems better?

Well, it is not normal. For me, it can be infered in NVIDIA-V100-24GB within 25s
The memory it uses is probably close to 24GB. If you're in an OOM situation, you might consider using flan-t5 or hifi-gan for CPU inference, leaving the MDT model part on the GPU.

Comparing with Musicgen, this project should ensure that both the quality of the content of the music and the aesthetic musicality are superior to Musicgen. and our approach innovately introduced a quality-aware training strategy with much smaller parameter request to Musicgen (675M vs 3.3B) and open-source training set

Answer 6 · 2024-09-23T08:32:49.000Z

However, we have to admit that our music length is limited to 10s (It can be extended, but we haven't done it yet), additionally, you can change the DDIM inference with more advanced solver (e.g. DPM, consistency model) to enhance speed.