Some question regarding the Training
Respaired opened this issue · 0 comments
1- Should I or Can I train a model with Higher sample rate (let's say 48khz) instead of 16khz or this architecture just simply don't support that? I was wondering what would happen if i simply start the training on a higher sample rate with the provided model.
2- How about Style Transfer from an Audio / Music to another alongside prompts? I mean conditioning an audio based on another or on the melody or the sounds based on a reference audio sample. I'm asking this since it appears Models trained with this code have their own specific Inference regime compared to the original repo.
3- Is there going to be any implementation for Stereo generation?
4- Also In the paper it was mentioned only a single RTX3090 can be used to train the model. (You also used 8x A100 for the AudioLDM2), I was wondering how long did it take, since I think it wasn't mentioned in the paper or perhaps I'm missing it.
5- last but not the least, how much should I increase unet_in_channels and number of res blocks, embed dim etc. in the config file to get a model size as big as the largest AudioLDM?
Thanks in advance. Looking forward to AudioLDM2's code as well if there's going to be any.