bethgelab/siamese-mask-rcnn

Results

Opened this issue · 5 comments

@michaelisc Hello, when testing your pretrained weights, Iam having a very weird results:

image

I also tested with arbitrary reference images, and the detected results were nearly the same. Seems the network doesn't care much about what the ref image is.

I think it would be a big problem.

Hi @trungpham2606 and sorry for the late reply. This looks actually pretty bad but was kind of expected from the results we saw on coco. As we discuss in the paper false positives are a big problem. The objects in your image match pretty well with what a typical object in coco would look like. Thus it is able to detect them well but the true/false decision is not working properly. It is sad to see, that it fails that obviously. I guess we will need more research on how to do the classification part right. In the end this is a not at all trivial problem, to identify an object from a single image (even though for us it may seem pretty trivial).

Sorry I can't provide a better answer. I guess you will have to wait until we or someone else comes up with a better solution. You may have a look at the other few-shot detection methods with available code, which to my knowledge are only LSTD and Few-Example Object Detection. However both require you to train their model on your images and don't work directly when only showing the reference. The other method that works when only showing the reference has no available code, just a dataset.

@michaelisc I think the reference images now doesn't have much effect in distinguishing the differences between features of target image and ref image.

Yes, this is exactly the problem. While it has some relevance, the decision is also based on dataset statistics. Why I mentioned false positives is, because if insecure whether to detect an object or not our model has a strong bias to detect the object even if the correspondence with the reference is small.
Another point that plays a role here is that it has to decide whether an object is detected or not based on the similarity to the reference and an internally learned decision criterion. This criterion is based on coco instances, which are on average pretty far apart semantically and visually (a zebra looks very different than a potted plant). The buttons and objects you show are however pretty close visually (and semantically, at least closer than a zebra is to a button).

A suggestion I'd have is to use the predictions to quickly annotate a new dataset, which you can then use to train a standard faster or mask r-cnn. Training on cityscapes works with <3000 images. As your problem is more simple I guess 10s or a few 100 might be enough.
Really sorry that Siamese Mask R-CNN does not work well enough for the task yet. Detecting arbitrary objects from a single reference is still a really hard research problem. Looks like we will need some more iterations before this is ready for your application.

@michaelisc Yeah, I see. Thank you for your support ^^. Hope to hear from you new update.
What do you think if after having those predictions (lets call raw), apply an extra comparison each region of detection with reference images (maybe distance or something) to conclude the final detection ? Hence, the problem is now how can we detect all objects in an images ( cause I dont want to miss any objects) ? Do you know any models that can detect arbitrary objects in images ?