hkchengrex/Cutie

'a mask prediction' in Sec. 3.2.2 of Paper

Closed this issue · 5 comments

Is the mask prediction single channel, i.e., H×W×1?

image

Yes.

I have a question about the detail of Object Memory

  1. The object memory are computed by N pooling masks $W$. However, these pooling masks do not have a constraint label, unlike the mask $M_l$ projected from the pixel features constrained by GT mask. I can't understand the information contained in these pooling masks and why one half can be foreground predictions and the other half is background predictions. I wonder if you have directly visualized these masks.

Isn't $W$ generated by the memory feature $F$ through a MLP?
image

What do you mean by "constraint label"? W is directly constructed from M_l in the screenshot that you provided. There are no additional transformations. Those masks are just the masks in Figure 4 (and their inverse).

Figure 4 shows the $M_l$ rather than pooling masks $W$.

Oh, right. Sorry -- it slipped my mind. We have visualized them before at some point. IIRC those masks are rather diffuse and don't have very recognizable patterns. They are learned end-to-end.