Question about the batch part
SirRob1997 opened this issue · 1 comments
The implementation of the batching part seems quite unintuitive for me, maybe you can clear up some of my understanding:
We calculate the before_vector
and after_vector
which represent the class probabilities for the correct class before and after applying the masking for certain samples inside each batch.
Next, we subtract the before_vector
from the after_vector
which means entries in change_vector
represent if the masking makes our classifier more / less certain about the correct class for that specific sample. This is represented by negative (more) and positive (less) values inside change_vector
.
We are only interested in the positive values, cases where masking decreases confidence, hence we calculate the threshold for Top-p according to only the positive values as done in L.134 and in L.135.
Next, we check which entries are greater than our threshold in L.136, this yields a binary mask.
This is where my question comes in:
L.137 basically inverts the mask. So instead of reverting the masking for Top-p percentage of samples where it decreases confidence, we are now reverting it for all samples besides Top-p?
Am I correct on this? Why was this done? For self-challenging, applying the masking for Top-p percentage of the samples with negative values seems more intuitive.
Also, while you're at it:
What is the purpose of subtracting 1e-5 in L.133? For me, this seems like a "threshold" (epsilon) i.e. the minimum confidence change to keep the masking. How did the performance change without it? In theory, this would be another hyperparameter
Hi, thanks for your question. L.138 nonzero function inverts the mask again, so it means masking for Top-p percentage of the samples. Do I explain it clearly?
About the purpose of subtracting 1e-5, I just try to avoid some corner cases, for example, if change_vector's elements are all zero.