microsoft/MaskFlownet

Some Question About the Ablation Exps

lidongyv opened this issue · 2 comments

Thanks for your great work!
I understand u stress the unsupervised learning of the mask and I just read your code to make sure u successfully learn the mask in an unsupervised manner. But I wonder that we all get the occlusion maps as the supervision to train the MaskFlownet-S. Simply adding an EPE or a cross-entropy loss may guide the MaskFlownet-S to learn a better attention mask. I understand it will take a long time to generate all of the mask maps in these datasets. It is indeed a problem.
Here are some questions about the demonstration of middle results:

  1. Did u do the comparison of the supervised and unsupervised learning of mask and their final influence on the result?
  2. The mask seems to be right on the foreground-background case, but actually, we don't really care about the background flow which is far from the foreground. Do u get more visualization of the mask on the objects which are moving close to each other like the third and seventh row in Figure 12?

Thanks for your attention.

Hi Lidong, thanks for your interest and your insightful questions!

For question 1, we haven't done any experiment on the comparison. A problem is that the mask used to filter useless information (shaded area) after warping might not be the same with the ground truth occlusion map at each level. Please feel free to draw comparison between our method and others' and we are looking forward to hearing what you discover!

For question 2, we have already released the weights to produce the visualization, so you can use them to produce more! What we think is that, object-object case has smaller size and larger relative motion, so it might be harder than fore-background case, but the mechanisms behind them are the same- background can be seen just as an object with relatively small motion.

@simon1727 Thanks for your answer.

For question 1, the reason I talked it is that I used to conduct some exps on occlusion years ago. I did use the occlusion map(ground truth) to guide the refinement, it gave a great promotion. But the promotion becomes not that obvious when I switch the occlusion map to the supervised learned one. Although my result is lost now, I still remember how hard it is to learn the occlusion map.
It is a great discovery of your work to show that the unsupervised learning of attention maps might be more proper to guide the refinement. An unsupervised activation map from the learned feature map in the last few layers indeed shows the attention for the final activation. Thanks again for your discovery.

For question 2, what I am interested in is actually the big motion cases. Object-object cases are always having bigger relative motion. I may be able to generate some cases on crowded scenes if I can access the data later. The problem might be changed if the motion is too big as we all know both Flownet and PWC is hard to cover a low-frequency video. I might need to conduct more exps to see the result.

Thanks again for your great work and patience.