mahmoudnafifi/C5

Many detaild questions

shuwei666 opened this issue · 1 comments

Thanks for your great work! it has indeed sparked a lot of inspiration for me. However, there are several aspects that I would like to discuss further:

The paper mentioned: "To allow the network to reason about the set of additional input images in a way that is insensitive to their ordering, we adopt the permutation invariant pooling approach of Aittala et al."

1. Could you elaborate on why insensitivity to ordering is crucial? Specifically, I'm curious whether a sufficiently large training dataset would inherently cover all potential orderings.

Regarding the number of additional unlabeld images (m), it appears that were used in both the training and testing stages. From the ablation study, it seems that various values of m were only tested on the test camera, as illustrated in Table 4. I have a question about this:

2. During the training process, did you experiment with varying quantities for 'm', or was there a consistent fixed number applied throughout, for example, 8?

When m equals 1, I understand that this means only the query image is used during testing. If so, my question is:

4. Could you clarify whether m=1 only signifies the zero-shot condition, i.e., just inferring, or does it mean that the single query image is used for self-calibration, followed by parameter fixation, and then inference?

5. From the results shown in Table 4, it doesn't seem that the results improve as m increases(i.e., error(m=13)>error(m=7)). Could you provide some insights into this?

6. Have you considered using additional labeled images for fine-tuning? If so, would this lead to better results than the current method?

Thank you for taking the time to answer these questions. Your responses will be greatly beneficial to my understanding.

Hi, thanks for your questions. Here are the responses below.

Could you elaborate on why insensitivity to ordering is crucial? Specifically, I'm curious whether a sufficiently large training dataset would inherently cover all potential orderings.

It may happen that the network ignore one or more of the additional inputs and rely on others during training. To prevent that we did the permutation invariant pooling.

During the training process, did you experiment with varying quantities for 'm', or was there a consistent fixed number applied throughout, for example, 8?

The value of 'm' affects our network architecture, where we have 'm' encoders. So when we say, for example, m=7, that means we have 7 encoders, and ofc that 6 additional images are used for training and testing.

Could you clarify whether m=1 only signifies the zero-shot condition, i.e., just inferring, or does it mean that the single query image is used for self-calibration, followed by parameter fixation, and then inference?

m = 1 means that only the query image is used as an input with no additional images

From the results shown in Table 4, it doesn't seem that the results improve as m increases(i.e., error(m=13)>error(m=7)). Could you provide some insights into this?

This needs more investigation, but probably one of the reasons is that m=13, for example, requires 13 encoders and increases our model capacity thus leads to some overfitting.

Have you considered using additional labeled images for fine-tuning? If so, would this lead to better results than the current method?

Fine-tuning on testing set would definitely help and may lead to better results. But the target of our paper is to avoid any further training/tuning .. kind of taking on a challenge, but in practical I would argue that fine-tuning on a small set is still practical and may lead to better results.