Confusion about the residual modeling
Opened this issue · 1 comments
Nice work !
In the paper, the residual clean speech x_0 and the residual noisy speech y_0 are adopt for the input of the stochastic model S_θ.
However, in the CVPR2022 paper 'Deblurring via Stochastic Refinement', I find that for the stochastic model, they use a blurry image y and the clean residual x_0 - gθ(x_0) as input, where the x_0 is the clean image and gθ(·) is the deterministic model.
Here comes my confusion. You use the residual noisy speech y_0 as the condition of the diffusion model, while the CVPR paper directly adopts the blurry image y as the condition. Since the diffusion is processed for the residual, I think your solution is more straightforward.
I'm not sure if my understanding is correct, and I would like to hear your insights.
Hi @zzwei1
Your understanding is very correct, one thing to add is that adding noisy to the diffusion process and just using noisy as a condition for the diffusion model will bring different results, I don't know if you are interested in speech enhancement, I will put up a comparison audio of the two generated effects after a while. I am preparing a follow up article, thanks for your attention.