[Summary] Add internals-based feature attribution methods
gsarti opened this issue ยท 15 comments
๐ Feature Request
The following is a non-exhaustive list of attention-based feature attribution methods that could be added to the library:
Notes:
- Add the possibility to scale attention weights by the norm of value vectors, shown to be effective for alignment and encoder models (Ferrando and Costa-jussร '21, Treviso et al. '21)
- The ALTI+ technique is an extension of the ALTI method by Ferrando et al. '22 (paper, code) to Encoder-Decoder architectures. It was recently used by the Facebook team to detect hallucinated toxicity by highlighting toxic keywords paying attention to the source (NLLB paper, Figure 31).
- Attention Flow is very computationally expensive to compute but has proven SHAP guarantees for same-layer attribution, which is not the case for Rollout or other methods. Flow and rollout should be propagation methods rather than stand-alone approaches since they are used for most attention-based attributions.
- GlobEnc corresponds roughly to Attention x Transformer Block Norm but ignores the FFN part, that in the latter is incorporated by a localized application of Integrated Gradients with 0-valued baselines (authors' default)
@gsarti I was reading through the Jain and Wallace paper earlier and looking at their code, but I am not entirely sure what you mean with Last-Layer and Aggregated attention in this context. It would be great if you could maybe give a short explanation of the methods you want to implement from this paper.
@lsickert The attention-based feature attribution methods you mention involve the simplest case of taking the average attention weight for every token across all model layers, or the attention weight for every token in the last model layer. They correspond to the baseline ("Raw") used in the Attention Flow & Rollout paper, which is in turn taken from Vig 2019.
๐ค Transformers currently supports extracting attention scores at every layer during inference, so these can easily be obtained without further steps.
@gsarti Thanks a lot, that explains it. I was not sure if I was missing anything from the Jain and Wallace paper since they were introducing their methods for Adversarial attention, so I got a bit confused on what should actually get implemented here.
@gsarti I think this concerns all attention methods, so I wanted to get your opinion on this before further implementing it:
To run the attention-based methods, we need the output_attentions=True
parameter set when the model is initialized through Huggingface, but at the moment where the model is initialized, we do not yet know if the method we are using is an attention-based method or not (at least if I understood the order in the code correctly).
So as the current workaround, I can just always set this in the model_kwargs
parameter manually, but I feel like this is a bit unintuitive and prone to errors if you forget this parameter. So we could either set this parameter to true by default even if it might not be needed by all methods or would need to change the ordering of the HuggingfaceModel
initialization a bit so that we first check what kind of method we are using before initializing the model. We then could add any needed arguments to the modelconfig through some kind of generic method of the respective classes. The third method would be to just check this later and throw an error if this is not set for the attention methods.
Good point, I'd say returning the attention scores by default shouldn't be a problem, and it's probably the easiest way to ensure compatibility with other methods without dramatic changes to the library.
Ok then I will implement it like that for now
@gsarti How would you want to deal with the information of multiple attention heads? I have seen several methods being used here in the different papers of either using the information from a single head (usually the last one), averaging out the values or concatenating/merging them as it is done inside of the attention layer. From what I understand so far, this does not have an influence on the actual methods but only on which information is used inside of them. Making this configurable would also be an option I think.
Additionally, since we are always dealing with encoder-decoder models at the moment, I am currently using the encoder and decoder self-attention outputs. Should I include the cross-attentions as well? I am currently not sure how to best include them in the outputs though
Wherever possible we want to make methods customizable, but with sensible defaults for those not interested in fiddling with them. For attention, the ideal setting would probably be to use the average across heads as the default, and let users specify one between average, max, or pick the number of the head they want to use (model.config
contains information about the number of available heads for HF models)
Regarding which scores to use, to match the outputs obtained from other methods we would need to use both cross-attentions for the default source-to-target matrix and decoder attentions for the matrix generated when attribute_target=True
. At the moment we are not interested in encoder attention since it only reflects the mixing of source information needed to generate contextual representations attended by the decoder (by means of cross-attention) during generation. As such, cross-attention conveys the totality of the information coming from the encoder, although the source embeddings weighted by it are likely to reflect more than simple token identity, thanks to the mixing mentioned above. To address this issue, more advanced attention-based methods like attention rollout, attention flow and ALTI+, both cross-attentions and encoder self-attentions are used to trace back the entire saliency flow of source inputs.
In short, for simple attention attribution you will want to use only cross-attentions and decoder attentions, knowing that the former do not faithfully reflect what goes on in the encoder. For more advanced methods, encoder self-attentions will probably be needed in the overall computation of source-to-target saliency.
Ok, thanks for the explanation. One follow-up question: How would you specify "max" in this context? Taking the head with the overall maximal attention values or using the max values for each token across all heads and therefore mixing the information from multiple heads?
The latter would probably make more sense, deeming relevant what was relevant at least for one head. I don't have an intuition for what to expect from the results though, so let's see what outputs look like and decide whether it's worth including.
@gsarti Sorry for all the questions, but there is another issue that came up:
Since we are using the generate()
method, most models I have tested have a defined number of beams for the beam search in their config. So their attentions tensors are currently of the shape (batch_size*num_beams, heads, seq, seq)
. Do you know of any good method to extract only the beam path that was used by the model in the end from this tensor? I am unfortunately not too familiar with the beam search algorithm, so the rest here is mostly from my limited understanding.
I have looked into some of the other implementations, but unfortunately could not really find anything on this, as they either seemed to just set the number of beams to 1 to use greedy search or did not use generation in their examples. I think just using greedy search would be an easy solution here, but I feel like it might reduce the overall quality of the results and might also lead to limitations if we accept pre-generated texts and attention scores for the attribution.
If there is no good way to extract this information, I can probably write an algorithm for this, but it would mean setting the output_scores=True
and manually traversing all possible paths, to find out which beam was used at each token.
This is an important point. In the current gradient attribution approaches, if the user does not provide a target output we first generate the output sentence using whatever strategy is supported by the model we're using (e.g. a custom sampling K and beam size). Then, the generated outputs are taken one after the other as targets for the attribution process. By doing this, we don't need to worry about the strategy that was used to obtain the output but simply to attribute it as if it was produced by greedy decoding.
Concretely, the same can be done with attention. Provided the output that we want to obtain, either by forcing it as input of attribute
or by letting the model pre-generating it, we do not need a decoding strategy but just to consider model attention scores for step n
given the n-1
preceding ones, corresponding de facto to a greedy decoding in which the output is already known.
We don't lose any generality in the process, since we are only interested in explaining a specific generation, and not the whole range of potential outputs. The latter can anyways be achieved very simply by considering the top K beam outputs and using them for forced decoding separately, effectively obtaining rationales for each individual sentence.
Hope it helps and it's clear enough!
Hmm, I am not sure if I follow entirely. The main issue is that transformers is giving me all attention scores for all steps. If I understand it correctly now, to see which one is chosen at each step, I would then need to use the output_scores
parameter to also be given the scoring for each token at each step and do a comparison with the one that was chosen to see which attention scores to use for attribution at each step. This is definitely better than my first idea but would mean that when the target value is provided as parameter, we would need to tell the user to either use greedy decoding or provide us with the beam scoring as well so that we can figure out programmatically which beam corresponds to which token at each step.
I can try to explain the issue better at our next meeting, but for now you can simply proceed assuming that greedy decoding is used!
Further possible additions to Basic Attention methods:
- rename LastLayerAttention to single-layer attention and make the layer configurable (last layer by default)
- allow users to choose a range of layers to average over instead of averaging over all layers always
- allow users to choose the maximal layer in single-layer attention