Why do we have different/decreasing`skip_values` as we progress in stage 03 training

Question

Why do we have different/decreasing`skip_values` as we progress in stage 03 training

Closed this issue 2 months ago · 3 comments

I wanted to know the idea behind the concept of having different max_skip_values, The value starts with 10, increases to 15 and again drops back to 5,5.

Is there any intuition and reason for doing this?

Also another question I had was since the number of frames used for training was 8 as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage?
Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.

Answer 1 · 2024-10-10T17:54:01.000Z

It is for curriculum learning: first, from easy to hard cases, then anneal back to 5, which is closer to what is used during inference.

Note that max jump sets the maximum. We don't always sample at the maximum.

See
https://arxiv.org/pdf/2103.07941
https://davischallenge.org/challenge2020/papers/DAVIS-Semisupervised-Challenge-1st-Team.pdf

Answer 2 · 2024-10-12T08:23:46.000Z

Thanks.

Can you also give some intuition on the second part, where how come training on 8 frames videos with max 3 frames in memory, leading to better long video segmentation capability during inference?

Also another question I had was since the number of frames used for training was 8 as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage?
Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.

Answer 3 · 2024-10-13T06:43:18.000Z

It generalizes. It is not unlike how CNN generalizes to different resolutions and how LLM generalizes to different sequence lengths with relative position embeddings. Learning a robust appearance representation (as queries/keys) is enough to go a long way. It might not be optimal -- but we don't really have sufficiently long video datasets at the time.