train a control net and use it with SD?
loveofcsdt opened this issue · 1 comments
Hi @loveofcsdt,
Our architecture differs from Control Net. We modify the stable diffusion denoising network extending the kernels of the first convolution to make it able to deal with multiple spatial conditions (i.e., pose map, sketch, masked model, inpainting mask).
We call spatial inputs all the inputs fed into the denoising network concatenated to the latent variables.
To the best of our knowledge, in the Control Net paper, we didn't see cases in which the network is conditioned on multiple spatial inputs at the same time. Our method is natively able to deal with pose, sketch, and text at the same time, and our model can work when some of the input conditions are missing.
We didn't have the time to compare our version with Control Net, since we finished our research while Control Net was released.
Since we noticed that in other papers on generative tasks, the evaluation was usually qualitative, in our paper we also propose two novel metrics to measure the adherence of the generated images to the relative input conditions (i.e., pose metric and sketch metric).