NZqian opened this issue a year ago · 0 comments
It seems that the the model is contitioned on text embedding in the config, while the paper concludes that it is better to use audio embedding, so which one is better?