Question about Probability Delta Attack
Closed this issue · 1 comments
Hi,
Thanks for the great work.
I have a general question about Prob-Delta-Attack in the paper.
While in the respective paragraph it is described as tokens that have their probability RISE and then FALL across consecutive layers (which is intuitive), I am not sure why you are using top_k(D_{l+1} - D_{l})
? Doesn't this represent set of tokens which have their probablities FALL then RISE within the progression of logit lens vocal distribution?
I might be missing something and I appreciate your comment on this.
Thanks
Hi,
top_k(D_{l+1} - D_{l}) will give the set of tokens whose probability rises the most when going from layer l to layer (l+1). Similarly, bottom_k(D_{l+1} - D_{l}) will give the set of tokens whose probability falls the most when going from layer l to layer (l+1). We observe that in the unedited model the probability of target token rises and remains high until the final layer. In the model edited using the deletion objective, the probability rises and falls to reach a low value in the final layers which causes the model to not generate it. So the probability of the target tokens first rises rapidly across a few consecutive layers and then falls across a few later consecutive layers before the final layer.
Hope this makes it clear!