lilt/alignment-scripts

*help* something that I want to confirm my understanding is right or not about this article.

Closed this issue · 30 comments

background
I have some details need to ask for your help since I am in trouble with reproducing the method (Add+SGD) results in the article "Adding Interpretable Attention to Neural Translation Models Improves Word Alignment".
my anlysis
I have reproduce the method (Rand + SGD) mentioned in this article, so I thought is my errors was caused by forward pass Initialization.
article discription
the article mentioned that: "Therefore, we run a forward pass of the complete Transformer network, extract the attention weights of the alignment layer and start the optimization process with these weights."
my understanding
my understanding is the forward pass model includes the alignment layer and the transformer structure. and I restore all parameters of the forward pass model which has trained.but I got the alignment results is 50% which is lower than your article.

I have reproduce the method (Rand + SGD) mentioned in this article, so I thought is my errors was caused by forward pass Initialization.

Were you able to reproduce the results of the alignment head without optimization (Add)?

my understanding is the forward pass model includes the alignment layer and the transformer structure. and I restore all parameters of the forward pass model which has trained.but I got the alignment results is 50% which is lower than your article.

For debugging I would do the following:

  • Run the forward pass and extract the attention activations from the alignment layer.
  • Verify: Check what AER you get from those attention activations.
  • Initialize the Variable of the Attention Optimization network (which you want to optimize later) with the attention activations and run a forward pass without optimization. Output the variable containing the unchanged activations. (Unchanged because you did not run a backward pass.)
  • Verify: Verify that the AER does not change when you extract the attention activations from the variable above.
  • Run a single optimization step (that is backward pass where you only update the single attention variable).
  • Verify: Extract the values from the optimized variable again and check if the AER improved.
  • Now you should be able to run the full process with multiple optimization steps.

thank you for your reply, and I have some terminology understanding problem.

  • Verify: Check what AER you get from those attention activations.
  • the Attention Optimization network
  • Initialize the Variable of the Attention Optimization network (which you want to optimize later) with the attention activations and run a forward pass without optimization.

those attention activations is means alignment layer initialized methods?
the Attention Optimization network is means alignment layer and transformer standard structure ?
Runing a forward pass without optimization ? without optimization is not update variable usually.

at present, My reproduce the ADD+SGD get the nearly same AER with the RAND +SGD.

It's important to look at the forward pass and attention optimization as different steps. (The following uses tensorflow concepts of Variables and Tensors.) In the forward pass your attention activations will be a tensor, during attention optimizations you have to store the attention activations in a Variable, because you can only optimize Variables. Therefore your implementation (in tensorflow) of the attention optimization network is slightly different compared to the Alignment Layer for the forward pass, as you have to store the attention activations in a Variable first and then update this variable. In the forward pass the attention is the result of a calculation and thus a Tensor, and you cannot directly optimize Tensors.

When you look at Figure 1 in the Arxiv Paper, during the forward pass you run the whole network and extract A. When doing attention optimization you only evaluate the "Attention Optimization" Subnetwork. A will be a Variable now, and for efficiency you would also cache V' and feed it to the Attention Optimization subnetwork.

those attention activations is means alignment layer initialized methods?

No, you get those attention activations by fetching the values of the attention vector of the alignment layer from the forward pass.

the Attention Optimization network is means alignment layer and transformer standard structure ?

The Attention Optimization subnetwork is (a part of) the Alignment Layer. In contrast to the forward pass the Alignment Layer does not calculate the attentions, but stores those in a Variable, and optimizes only this Variable during backpropagation.

Runing a forward pass without optimization ? without optimization is not update variable usually.

Yes, you run a forward pass through the whole Transformer. You will initialize the Variable A of the subnetwork Attention Optimization with the value extracted from the forward pass.

Thank you for your kindness reply. I have read your reply many times and I have try to understanding your reply and article. I am a student I am not understanding with some description in your reply.

in your description

during the forward pass you run the whole network and extract A.

my understanding is that the transformer network is forward pass and the alingnment layer is trainable parameters which can be update. I have to run the forward pass to attention optimization and then I extract the weights in A. last I start new optimization with restore those weights in A and the transformer parameters is fixed?

in your acticle,

during inference we optimize the attention weights A of the sub-network attetnion optimization.

as usual, the inference step is a process to obtain predicts and no optimization. So I am a little confused.

in your description, "during forward pass and during inference" is means fix other parameters besides attention weights?

as usual, the inference step is a process to obtain predicts and no optimization. So I am a little confused.

Given a trained network (which includes the trained Transformer and the trained alignment head) during attention optimization (which is done during the inference step in the paper) you only optimize the attention activations. In tensorflow you can do this by using the var_list parameter:

opt.minimize(self.loss, var_list=[self.attention_variable])

To optimize the attention activations you have to use a variable instead of a tensor (at least in tensorflow), which changes the implementation of your network slightly.

To optimize the attention activations you have to use a variable instead of a tensor (at least in tensorflow), which changes the implementation of your network slightly.

I thought the attention activations A is means the trainable metrix and then I set it to be trainable=True and other parameters trainable=false, this method is will be ok.
or you are try to explain for me your the attention activation is the result of this Equation?
image

the attention activation is the result of this Equation?

Yes, also see equation (1) in the arxiv paper. During attention optimization treat A in Figure 1 as a variable and initialize this variable with the result of equation (1).

A is not a variable in model and it is not a parameters . How can I initialize ? it is confused.

For Attention Optimization you have to define a slightly different network using the Variable $A$, also see: #3 (comment)

OK, I am sure your expression is that you want to express.

I have another things to confirm, so the A is the only parameters which is trainable in attetnion optimization? or the parameters in attention optimization parts on Figure 1 in the Arxiv Paper.

I have another things to confirm, so the A is the only parameters which is trainable in attetnion optimization?

Correct. See: #3 (comment)

the A is the results of attention and A trainable matrix is related to batch_size, source length and target length. So it is how to choose which batch to extract A and initialize?

In other words, the results of attention is changeable with different batch. variables is a parameters with fixed shape.

You can use a batch size of 1 for attention optimization, a variable of shape (1, max_length, max_length), get the src_len and tgt_len of the current example dynamically using tf.shape and use A[:, :target_length, :source_length] to get a submatrix of A with the right dimensions.

I have try and faild to get the results in your article. RAND SGD and ADD+SGD get the same results 62.2%. you say using the max_length as the attention activations dimensions but the A extracted from forward pass is dynamically different, under this consideration, I using the max_src_length and max_tgt_length in the all training data store and ,during inference, initialize A[:, :target_length, :source_length] = extract_A . (where A.shape=[1,max_length, max_length])

Please verify first if you get a reasonable AER when using the alignment layer without SGD (Add). This probably already fails in your case.

I have verify and get better training loss, and it is useful for optimize attention activations. I have a problem about understanding, If I using the A as variable and A is attention weights. When I using attention weights to get alignments after training, It will be always be a certain value as long as sentence pair with equal sentence length both source length and target length. according to my knowledge, it is abnormal since the same length sentence has different structure and alignment.

When I using attention weights to get alignments after training, It will be always be a certain value as long as sentence pair with equal sentence length both source length and target length.

During training you do not use a Variable for the attentions. You can just extract the attention of the alignment head of the trained model, convert it to alignments and score these alignments.

You only use a variable when performing attention optimization. I will close this issue for now as this discussion did not progress anymore recently.

Sorry, my expression is inaccurate, this reproduce is very important for me. I spend a lots of time and have not progress.

I have down the alignment layer training and attention optimization. I retell as follows:

the question is that:after the attention optimization, the attention weights is attention activations A, so A will be unrelated to target sentence information and it is only related to the length of target.

if I did not misunderstanding your former explanation. @thomasZen

@thomasZen Thank you for your kindness help, I have reproduce your work with the results

-- AER (deen) AER(ende) BIdir
sgd+ lr0.001(best 6step) 25.6 30.2 23.5

@PlayDeep That's awesome, congratulations!

@thomasZen help Have any possible reason lead my bidir aer lower than yours? I have consider the overfitting and I have check all the checkpoint combination ( I combine each alignment results of every ende and deen checkpoints)

When using scripts/combine_bidirectional_alignments.py the method grow-diagonal worked better for this scenario than grow-diagonal-final, that could be a possible reason.

I have test all methods in this scripts mentioned and every ende and deen checkpoint combination. and not get a better results.

Interesting, a common mistake after doing multiple SGD steps was that punctuation marks would always get aligned to other punctuation marks. For sentences with a different number of commas, that was not a useful behavior. I could imagine that by doing more (6 instead of 3) optimization steps these errors could appear more often.
If you use --source and --target in scripts/aer.py it will output the most common errors, looking at these is interesting and could be helpful.

@thomasZen Thank you for your insight comments, I using SGD with learning rate 0.001 which differ from yours, That might is the main reason for I using 6 instead of 3 steps. And I thought my learning rate is lower than yours, this might is a mismatch with transformer training parameters. Because I using the default parameters as "attention is all you need". I want to verify this guess.

There are two different learning rates that can be different:

  • The learning rate for training the alignment layer
  • The learning rate when performing attention optimization

We tuned the learning rate and the number of optimization steps for attention optimization on the development set.