antonilo/unsupervised_detection

Why the network can learn moving segments instead of the complementary

Closed this issue · 4 comments

lliuz commented

Hi @antonilo
Thank you for sharing your impressive work. I ran your code and got results similar to yours.
However, I have some doubts about how you can ensure that the network learns the moving segment instead of the complementary? It seems like you train the moving mask and its complementary in a completely symmetrical way.

lliuz commented

Hi @antonilo

I read your code carefully and I think that your work is not very convincing.
I would like to raise some questions here:

  1. You adopt the trainval split for training and val for testing, so the samples for testing have appeared in the training stage. This is informal in the CVPR field, but you have not pointed it out in the paper. I think this is very imprecise.
  2. I found that your work cannot distinguish between the moving foreground and the background at all. When I ran your code, I saw something strange that some masked_flow are masked by foreground, and some are masked by background, but the results in testing are fine.
    image
    I double-checked your code, and I found in the function compute_IoU, you select the mask from the generated mask and its complementary by comparing with the ground truth. I think it is somewhat cheating.

I hope you can give me an explanation.

Hi @antonilo
Thank you for sharing your impressive work. I ran your code and got results similar to yours.
However, I have some doubts about how you can ensure that the network learns the moving segment instead of the complementary? It seems like you train the moving mask and its complementary in a completely symmetrical way.

Thanks for your interest!

The main goal is to check whether the proposed model can separate the image domain into two domains each of which is independent to the other defined by the mutual information between two fields of motions. Due to the lack of benchmark for this task we choose to evaluate it on video object segmentation. Due to the innate ambiguity in the definition of foreground in video object segmentation (e.g. a car moving in front of a bus), we need some prior to tell which region will be the target, like in many unsupervised methods ( in "Unsupervised Video Object Segmentation with Motion-based Bilateral Networks", semantic segmentation masks are used to disambiguate between foreground and background; in "Unsupervised Online Video Object Segmentation with Motion Property Understanding", salient object detection and object proposals are used; also in "Primary Object Segmentation in Videos Based on Region Augmentation and Reduction" saliency map and motion edge map are used for disambiguation), not including the others that use MSCOCO with thousands of gt masks to do pretraining for disambiguation. We could use any as a disambiguation step in the training but not for training efficiency. So we moved this step to post-processing instead of doing it online. And no mask is used as a regression goal. The score in training you have seen is just a rough indication of the training process but not the accurate score.

Hope this would resolve the misunderstanding and address your question.

@lliuz,
Thanks for your feedback. As my co-author said, the only reason why the gt masks are used is to give an indication of the final score. However, it will not be used to calculate the final predicted masks, since we use temporal consistency in post-processing to tackle the inverting problem.
Also, as our model does not need annotation at training time, we fine-tune it on the validation data. This is a common practice in all unsupervised learning papers (for optical flow, depth prediction, or any other task).

However, we would like to thank you for raising these valid concerns, the code is indeed ambiguous and we will update it to avoid confusion of future readers!

@lliuz, in #5 we have moved the object mask detection from post-processing to the training to avoid confusions. Documentation has also be updated to answer this question. Please let us know if anything else is unclear. Thanks again for your feedback!