cross-attention maps are not robust
jinxixiang opened this issue · 4 comments
jinxixiang commented
jinxixiang commented
songweige commented
Hi @jinxixiang, thank you for your interest and for trying out our demo!
You are right that the token maps are sometimes not quite stable and accurate. We have been working on improving this and had some progress. Here is the example that changes the color of the hair.
jinxixiang commented
@songweige Woo, this improved result is great! What did you modify with the cross-attention map? It seems to be refined.
I think one walkaround of this problem is utilizing the ability of Gounding SAM by splitting the denoising process into two stages. The first stage is to get the region mask, and the second stage is to conduct region-based diffusion. We are working on it.
songweige commented