floodsung/LearningToCompare_FSL

About the testing problem

xenuts opened this issue · 1 comments

Nice work, but i found a problem that really confuse me.

As shown in the code omniglot_train_few_shot.py, both in the training and testing phase, the support set (i.e. sample_images) and evaluation set (i.e. test_images) are drawn from the same 5 classes (called one task). And as your way to calculate accuracy, it's easy to achieve ~99% during training.

Here i found a problem and I don't know why ? when I draw support set from one task and draw evaluation set from another task, apparently the two tasks contains different 5 classes.

So I presume that, I will get low confidences after feeding them into the network, but the results are not.
Here is the testing case i used:

** [TESTING set] CLASS_NUM=5, SAMPLE_NUM_PER_CLASS=5
the character classes are:
['Angelic/character11',
'Angelic/character11','
'Angelic/character11',
'Angelic/character11',
'Angelic/character11',
'Syriac_(Serto)/character08',
'Syriac_(Serto)/character08',
'Syriac_(Serto)/character08',
'Syriac_(Serto)/character08',
'Syriac_(Serto)/character08',
'Japanese_(hiragana)/character42',
'Japanese_(hiragana)/character42',
'Japanese_(hiragana)/character42',
'Japanese_(hiragana)/character42',
'Japanese_(hiragana)/character42',
'Gujarati/character27',
'Gujarati/character27',
'Gujarati/character27',
'Gujarati/character27',
'Gujarati/character27',
'Glagolitic/character09',
'Glagolitic/character09',
'Glagolitic/character09',
'Glagolitic/character09',
'Glagolitic/character09']
** [Support set] CLASS_NUM=5, SAMPLE_NUM_PER_CLASS=5
the character classes are:
'N_Ko/character27',
'N_Ko/character27',
'N_Ko/character27',
'N_Ko/character27',
'N_Ko/character27',
'Japanese_(katakana)/character18',
'Japanese_(katakana)/character18',
'Japanese_(katakana)/character18',
'Japanese_(katakana)/character18',
'Japanese_(katakana)/character18',
'Oriya/character33',
'Oriya/character33',
'Oriya/character33',
'Oriya/character33',
'Oriya/character33',
'Tibetan/character14',
'Tibetan/character14',
'Tibetan/character14',
'Tibetan/character14',
'Tibetan/character14',
'Tifinagh/character45'
'Tifinagh/character45'
'Tifinagh/character45'
'Tifinagh/character45'
'Tifinagh/character45'

** and i got the output confidences by probs, predict_labels = torch.max(relations.data, 1), as below:
0.9999995, 0.9999894, 0.00067013013, 1.0, 0.9999995,
0.0013619913, 0.45683807, 0.003507328, 0.99994755, 0.20433362,
0.9999981, 0.76437086, 0.4761213, 0.99345946, 0.25436476,
0.0002244339, 0.00026010931, 0.87288016, 1.8067769e-05, 0.00053879694,
1.0, 1.0, 1.0, 1.0, 1.0

It's readly weird, support and testing set have different classes, but the output confidences are so high.

If i randomly pick a image from the entire omniglot and I (assume) don't know its class, if I compare it with all possible support sets, how can I recognize its class, because the output confidences barely has discriminability.

Am i missing anything important or misunderstood ?

你好,我在测试的时候也遇到了和你一样的问题,请问你后来想明白了吗?谢谢回复