wl-zhao/VPD

Is using test file name for inference a fair practice?

Aradhye2002 opened this issue · 2 comments

Isn't it wrong to use the name of the test image file for the inference process? Like suppose I named them img1.png, img2.png, ..., then the code would not work. Also you can't do inference with images whose class_id you don't know or even which doesn't fall into one of the class_ids.

In the depth estimation task, we think introducing the category name of the scene is not unfair. The task focuses more on the low-level details while the provided category name is a high-level concept. In our VPD, we only use the category name to better exploit the pre-trained knowledge of the text-to-image diffusion model.

We can also run our model without the category name by using another simple network to predict the category of the scene (which is not difficult to learn), and the results should stay the same.

Thanks for the reply!

Another doubt that I had was about taking the mean for the text embeddings for a given class with all the imagenet templates. What is the motivation for this? In the stable diffusion model originally we are to give a single text sentence right?