DeLightCMU/RSC

Question about the batch part

SirRob1997 opened this issue · 1 comments

The implementation of the batching part seems quite unintuitive for me, maybe you can clear up some of my understanding:

We calculate the before_vector and after_vector which represent the class probabilities for the correct class before and after applying the masking for certain samples inside each batch.

Next, we subtract the before_vector from the after_vector which means entries in change_vector represent if the masking makes our classifier more / less certain about the correct class for that specific sample. This is represented by negative (more) and positive (less) values inside change_vector.

We are only interested in the positive values, cases where masking decreases confidence, hence we calculate the threshold for Top-p according to only the positive values as done in L.134 and in L.135.

Next, we check which entries are greater than our threshold in L.136, this yields a binary mask.

This is where my question comes in:

L.137 basically inverts the mask. So instead of reverting the masking for Top-p percentage of samples where it decreases confidence, we are now reverting it for all samples besides Top-p?

Am I correct on this? Why was this done? For self-challenging, applying the masking for Top-p percentage of the samples with negative values seems more intuitive.

Also, while you're at it:

What is the purpose of subtracting 1e-5 in L.133? For me, this seems like a "threshold" (epsilon) i.e. the minimum confidence change to keep the masking. How did the performance change without it? In theory, this would be another hyperparameter

Hi, thanks for your question. L.138 nonzero function inverts the mask again, so it means masking for Top-p percentage of the samples. Do I explain it clearly?
About the purpose of subtracting 1e-5, I just try to avoid some corner cases, for example, if change_vector's elements are all zero.