Question about the algorithm and training procedure

Question

Closed this issue a year ago · 2 comments

Hi Cheng,

I'm new to the VOS area and after reading the paper I've still got two questions about the algorithm.

Does the readout in XMem (and in Cutie) turn the VOS task from learning an $\text{img}\rightarrow\text{mask}$ map to learning a $\text{similar img}\rightarrow\text{similar mask}\rightarrow\text{mask}$ map through the retrieval process? (local feature level)
Is the long-term memory module not involved in the training process? Does it only occur at test time? As you state in the paper the training sequences are of length eight. Which seems smaller than the $T_{max}=10$.

Thank you for taking the time to read this issue. I greatly appreciate any advice you can provide.

Answer 1 · 2024-01-19T20:57:14.000Z

Answer 2 · 2024-01-21T05:29:23.000Z

Thank you for your reply!