open-vocabulary object detection with MViT

Question

open-vocabulary object detection with MViT

Opened this issue 5 months ago · 1 comments

Thanks for the great work!

In the paper, the authors use MViT to extract high-quality class-specific proposals using image-level labels based on its great generalization ability. Thus, a straightforward idea is doing open-vocabulary detection with MViT directly. Similar to OV-DETR, we can use prompts like 'every {category}' and forward the prompts multiple times to get top-score predictions for each class. Thus, we can perform open-vocabulary with MViT. However, my experiments show extremely poor results: 5.4 AP50 novel and 3.8 AP50 base. I'm confused about the results. Could you give me some advice?

Answer 1 · 2024-01-19T13:25:36.000Z

Hi @fushh,

Thank you for sharing your insights on using MViT for open-vocabulary detection. From your description, it seems you are encountering challenges in achieving high performance, particularly in AP50 metrics for both novel and base classes.

It is crucial to understand that MViT models like MAVL, and MDETR primarily focus on grounding the text to corresponding objects in the image. When presented with a category that exists in the image, they excel at identifying and grounding it. However, when the category is not present, the model might still assign high confidence scores to irrelevant regions. This misalignment is likely contributing to the poor performance you observed.

Some suggestions to improve the results:

One approach is to limit the model queries to only the classes known to be present in the image. For instance, if you're certain that the image contains dogs and umbrellas, you would use prompts like "Every dog" and "Every umbrella". This method helps in reducing false positives, as shown in the attached example. By focusing only on relevant categories, you improve the precision of the model, which is crucial in object detection where both accurate detections and avoiding false positives are important. For example, If you have a caption describing the image - one can assume the possible objects present in the image based on the caption.
Another strategy is to prioritize recall over precision. By doing so, you focus on the model's ability to detect all relevant objects without being overly penalized for incorrect class predictions. This shift in metric focus can provide a more balanced view of the model's performance in open-vocabulary settings.
Lastly, you can try to visualize the class-specif proposals from MViT. Understanding why certain incorrect predictions occur can guide further refinements in model training or prompt design.

Hope this helps. Thank you.