philipperemy/keras-attention

Questions on implementation details

felixhao28 opened this issue Β· 53 comments

Update on 2019/2/14, nearly one year later:

The implementation in this repo is definitely bugged. Please refer to my implementation in a reply below for correction. My version has been working in our product since this thread and it outperforms both vanilla LSTM without attention and the incorrect version in this repo by a significant margin. I am not the only one raising the question 1.

Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step). If you are looking for a ready-to-use attention for sequence-to-sequence networks, check this out: https://github.com/farizrahman4u/seq2seq.

============Original answer==============

I am currently working on a text generation task and learnt attention from TensorFlow tutorials. The implementation details seems quite different from your code.

This is how TensorFlow tutorial describes the process:

image

image

If I am understanding it correctly, all learnable parameters in the attention mechanism are stored in W, which has a shape of (rnn_size, rnn_size) (rnn_size is the size of hidden state). So first you need to use W to calculate the score of each hidden state based on the value of the hidden state h_t and h_s, but I am not seeing h_t anywhere in your code. Instead, you applied a dense layer on all h_s. And that means pre_act (Edit: h_t should be h_s in this equation) becomes the score in the paper. This seems wrong.

In the next step you element-wise multiplies the attention weights with hidden states as equation (2). Then somehow missed the equation (3).

I noticed the tutorial is about Seq2Seq (Encoder-Decoder) model and your code is an RNN. Maybe that is why your code is different. Do you have any source on how attention is applied to a non Seq2Seq network?

Here is your code:

def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a) # this line is not useful. It's just to know which dimension is what.
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False

ATTENTION_SIZE = 128

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # _t stands for transpose
    hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states)
    # hidden_states_t.shape = (batch_size, hidden_size, time_steps)
    # this line is not useful. It's just to know which dimension is what.
    hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t)
    # Inside dense layer
    # a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t)
    score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part)
    #            score_first_part_t         dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1)
    h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t)
    score = dot([score_first_part_t, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # if SINGLE_ATTENTION_VECTOR:
    #     a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
    #     a = RepeatVector(hidden_size)(a)
    # (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1)
    context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector')
    context_vector = Reshape((hidden_size,))(context_vector)
    h_t = Reshape((hidden_size,))(h_t)
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
    return attention_vector

The interface remained same except you don't need Flatten layer anymore:

def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The results seems even better than your original implementation:

image

The process of building attention myself has brought me more questions than answers:

  1. What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in a are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix?
  2. I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.

P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.

def get_data_recurrent(n, time_steps, input_dim, attention_column=10):
    """
    Data generation. x is purely random except that it's first value equals the target y.
    In practice, the network should learn that the target = x[attention_column].
    Therefore, most of its attention should be focused on the value addressed by attention_column.
    :param n: the number of samples to retrieve.
    :param time_steps: the number of time steps of your series.
    :param input_dim: the number of dimensions of each element in the series.
    :param attention_column: the column linked to the target. Everything else is purely random.
    :return: x: model inputs, y: model targets
    """
    x = np.random.randint(input_dim, size=(n, time_steps))
    x = np.eye(input_dim)[x]
    y = x[:, attention_column, :]
    return x, y

Being confused about why attention can learn info about specific index in input sequence, I went on and read the code in official tensorflow implementation. I was wrong about the attention_score_vec dense layer which is a.k.a "memory layer" in TF implementation. The weight matrix W is not a (time_steps, time_steps) sized but rather (hidden_size, hidden_size) as shown here. The correct implementation should be:

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # Inside dense layer
    #              hidden_states            dot               W            =>           score_first_part
    # (batch_size, time_steps, hidden_size) dot (hidden_size, hidden_size) => (batch_size, time_steps, hidden_size)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)
    #            score_first_part           dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot   (batch_size, hidden_size)  => (batch_size, time_steps)
    h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(
        hidden_states)
    score = dot([score_first_part, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # (batch_size, time_steps, hidden_size) dot (batch_size, time_steps) => (batch_size, hidden_size)
    context_vector = dot([hidden_states, attention_weights], [1, 1], name='context_vector')
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(128, use_bias=False, activation='tanh',
                             name='attention_vector')(
        pre_activation)
    return attention_vector

score_first_part stands for score_first_part, as part of score.

Surprisingly, even without any hard information on the index of sequence, attention model still managed to learn the importance of 10th element. Now I am super confused.

image

My guess is somehow LSTM learned to "count" to 10 in its hidden state. And that "count" is captured by attention. I will need to visualize the inner parameters of LSTM to be sure.

An interesting finding I made is how attention is learnt through time:

giphy

Full code (except attention_3d_block), showing here just for reference:

from keras.layers import concatenate, dot
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *

from attention_utils import get_activations, get_data_recurrent

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False


def attention_3d_block(hidden_states):
    # same as above


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid', name='output')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

if __name__ == '__main__':

    N = 300000
    # N = 300 -> too few = no training
    inputs_1, outputs = get_data_recurrent(N, TIME_STEPS, INPUT_DIM)

    if APPLY_ATTENTION_BEFORE_LSTM:
        m = model_attention_applied_before_lstm()
    else:
        m = model_attention_applied_after_lstm()

    m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    print(m.summary())

    m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)

    attention_vectors = []
    for i in range(10):
        testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
        activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')
        attention_vec = np.mean(activations[0], axis=0).squeeze()
        print('attention =', attention_vec)
        assert (np.sum(attention_vec) - 1.0) < 1e-5
        attention_vectors.append(attention_vec)

    attention_vector_final = np.mean(np.array(attention_vectors), axis=0)
    # plot part.
    import matplotlib.pyplot as plt
    import pandas as pd

    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.show()

I am actually thinking you were trying to implement self attention, which is used in text classification. But nonetheless the weight parameters should be sized (hidden_size, hidden_size) instead of (time_steps, time_steps).

@felixhao28 why do you use the layer that was named β€œlast_hidden_state”?

@Wangzihaooooo Because attention was first introduced in a Sequence to Sequence model, where attention score is computed based on both h_t and all h_s. In a language/classification model (sequence to one), we don't have the h_t to represent the information of the current outputting Y. Therefore I just used the last hidden state as h_t.

To be fair, you can totally remove h_t from the score computation, which then just becomes score = W * h_s. And it is essentially self-attention. It is different from traditional attention that self-attention only gives a score based on how important a hidden_state is globally, without the information of the current state of LSTM.

@felixhao28 thank you ,I learned a lot from your code.

@felixhao28 thank you very much. This is very well explained and removes the complexity around Attention Layer. I implemented the code inline for Seq2Seq model and able to grab attention matrix directly. Thanks once again for your help.

Regards
Rajeev

@felixhao28 I'm a bit confused in this part of the code:

attention_vectors = []
    for i in range(10):
        testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
        activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')

get_activations() effectively passes testin_inputs_1 through the layer 'attention_weight' and outputs the softmax probabilities for each. However, you are passing the raw input without making them pass through the LSTM first; is that on purpose? If so, can you explain why? Since in the model the inputs to the attention layers are the output of the LSTM layer(s), I would expect to have to do the same here.

Thanks!

you are passing the raw input without making them pass through the LSTM first

The input does pass through LSTM first. Layer is an abstract concept of how the tensor should be calculated, not the actual tensor to be calculated. The relationship is more like "class" and "instance" if you are familiar with OOP.

the outputs is the actual tensor (instance) of attention_weight layer, which has already been connected to previous tensors (computational graph) by attention_weights = Activation('softmax', name='attention_weight')(score). It is not this specific tensor that takes the testing_inputs_1, it is the computaional graph, which initially begins from inputs = Input(shape=(TIME_STEPS, INPUT_DIM,)).

@felixhao28 I see, thanks for the explanation!

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False

ATTENTION_SIZE = 128

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # _t stands for transpose
    hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states)
    # hidden_states_t.shape = (batch_size, hidden_size, time_steps)
    # this line is not useful. It's just to know which dimension is what.
    hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t)
    # Inside dense layer
    # a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t)
    score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part)
    #            score_first_part_t         dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1)
    h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t)
    score = dot([score_first_part_t, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # if SINGLE_ATTENTION_VECTOR:
    #     a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
    #     a = RepeatVector(hidden_size)(a)
    # (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1)
    context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector')
    context_vector = Reshape((hidden_size,))(context_vector)
    h_t = Reshape((hidden_size,))(h_t)
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
    return attention_vector

The interface remained same except you don't need Flatten layer anymore:

def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The results seems even better than your original implementation:

image

The process of building attention myself has brought me more questions than answers:

  1. What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in a are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix?
  2. I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.

P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.

def get_data_recurrent(n, time_steps, input_dim, attention_column=10):
    """
    Data generation. x is purely random except that it's first value equals the target y.
    In practice, the network should learn that the target = x[attention_column].
    Therefore, most of its attention should be focused on the value addressed by attention_column.
    :param n: the number of samples to retrieve.
    :param time_steps: the number of time steps of your series.
    :param input_dim: the number of dimensions of each element in the series.
    :param attention_column: the column linked to the target. Everything else is purely random.
    :return: x: model inputs, y: model targets
    """
    x = np.random.randint(input_dim, size=(n, time_steps))
    x = np.eye(input_dim)[x]
    y = x[:, attention_column, :]
    return x, y

Hi, can you clarify what you mean by "Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine."

@farahshamout Here is a rather complete explanation on attention over sequence to sequence model. The original idea of attention uses the output of the decoder as h_t, representing "current decoding state". If you think of the "many-to-one" problem as a special case of the "many-to-many" problem, h_t becomes the last hidden state of the encoder.

@felixhao28 I see, thanks!

Hi, i was trying to use your implementation, but i would like to save an attention heat map during the training (once for epoch), i tried to add return attention_vector,attention_weights but it is not what i wanted.
Do you have any suggestion?

@Bertorob I assume you added attention_weights to the outputs of the model. Sadly there is a limitation in Keras that every output needs to be paired with a "ground-truth y" and calculated by a loss function. So if you intend to collect attention_weights for every batch, you need to provide an empty but same-sized numpy array as the second "ground-truth y" in model.fit, and a custom loss function for attention_weights that always return 0.

If you only need attention heat map once per epoch instead of once per batch, model.train_on_batch is what you need to replace model.fit

@felixhao28 Thank you for the answer. However if i want to plot the attention after training, I suppose i don't need to add the second ''ground-truth y'' but i don't get how you are able to do it. Could you please explain how can you do that?

@Bertorob

This part of the code calculates the attention heat map:

    attention_vectors = []
    for i in range(10):
        ... # lines ommited
    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.show()

The attention_weights are not directly fetched during training. It isn't run until later after model.fit.

m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)

You see this line above just run one epoch. If you create a loop around it and change plt.show to plt.savefig then you get a series of images of the attention weights. Ultimately the code looks like this:

for epoch_i in range(n_epochs):
    m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)
    attention_vectors = []
    for i in range(10):
        ... # lines ommited
    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.savefig(f'attention-weights-{epoch_i}.png')

Edit: here I am still using model.fit instead of model.train_on_batch because the data here is really small and constant within each epochs. In reality though, you might want to use model.train_on_batch for better flexibility.

Ok i'm figuring something out. Last question now i tried something like this:

att_weights = []
for i in range(10):
   activations = get_activations(mymodel,np.reshape(x,(1,100,30)),print_shape_only=True,layer_name='attention_weight')
   attention_vec = np.mean(activations[0], axis=0).squeeze() 
   print('attention =', attention_vec)
   assert (np.sum(attention_vec) - 1.0) < 1e-5
   att_weights.append(attention_vec) 
attention_vector_final = np.mean(np.array(att_weights),axis=0)

where x is my input and actually i have my attention vector but it is filled with ones , maybe i'm still doing something wrong, why is there 10 in the for ?

EDIT: sorry i have understimated the relevance of return_sequences=True in the LSTM now i'm able to plot the attention map @felixhao28 thank you !!!!

@felixhao28 Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step).
Could you please show the detail of implementing seq2seq networks? I would so appropriate that. Is that just setting the return_sequences=True?

@LZQthePlane No it is more complicated than that. The basic idea is to replace the h_t with current state of the decoder step. You might want to find another ready-to-use seq2seq attention code.

Hi @felixhao28, Thank you so much for your code and explanations above.

I am new to learning attention and I want to use it after LSTM for a classification problem. I understood the concepts of attention from this presentation [1] by Sujit Pal :
[1] https://www.slideshare.net/PyData/sujit-pal-applying-the-fourstep-embed-encode-attend-predict-framework-to-predict-document-similarity

I got confused after reading your code about the type of attention (the theory behind it and how is it called in papers). does it compute an attention vector on an incoming matrix using a learned context vector?

hope you could help!

@felixhao28
Thank you so much for your code and explanation. I think it is quite right except a slight problem. In my opinion, score_first_part shouldn't relate with h_t, which means the inputs of attention_score_vec layer shouldn't include h_t. How do you think?

@Goofy321 How do you calculate the attention score then?

@OmniaZayed My implementation is similar to AttentionMV in Sujit Pal's code except that ctx is the last hidden state.

@Goofy321 How do you calculate the attention score then?

I mean the input of attention_score_vec layer change into hidden_states[:,:-1,:]. And the calculation of the attention score is the same as yours.

@Goofy321 I think that works too.

@felixhao28 : When i try to run your code I get following error when calculating the score:

score = dot([score_first_part, h_t], axes=[2, 1], name='attention_score')

ValueError: Shape must be rank 2 but is rank 3 for 'attention_score/MatMul' (op: 'MatMul') with input shapes: [?,20,32], [?,32]

Currently I can't figure out why the dimensions don't match, any idea? Did anyone else experience the same issues?

@patebel the shape of h_t should be (batch_size, hidden_size, 1), you are missing the final "1" dimension. Keras used to reshape the output of lambda layer to your output shape, maybe adding h_t = Reshape((hidden_size, 1))(h_t) will fix it.

@felixhao28 Oh yes, I didn't recognize, thank you!

Hi @felixhao28, thanks for your insights and helpfulness in this issue! Reading the original paper by Bahdanau et. al. and comparing the operations to this repository, I was really confused until I saw this.
I have a question for you and other people on this thread. I have a language model that gets fed in a sequence of length 50 in batch sizes of 32, and tries to predict the next token where the vocabulary size is 35. Hence, it is an application of many-to-one for text generation. Below is the version that generates logical output.

Capture0

However, when I apply the attention layer as you have suggested before the final dense layer for prediction with attension size of 256, I get extremely gibberish output, certain letters being repeated back to back in a nonsensical way. Below is that version.

Capture1

Any ideas why this approach fails? I have also tried without stacking LSTM layers, and it still fails. The only thing I can think of is that the token-level for this language model is characters, whereas I have seen attention applied mostly to word-level language models. Any help will be appreciated!

UPDATE: Solved it, turns out I didn't set one of the Dense layers to be trainable.

@felixhao28 and the others: When I'm running the example the activation weights are either all 1 (which i don't understand cause it's not possible by definition) or nan (which I understand neither :D). Did anyone else experience this behavior?

Hi @patebel, try to use the squeeze layer after the score vector:
from keras import backend as K
from keras.layers import Lambda
score = dot([score_first_part, h_t], [2, 1], name='attention_score')
score = Lambda(lambda x: K.squeeze(x, -1))(score)

It can be that the dimension of your score before applying the softmax function is (None, time_steps,1) when should be (None,time_steps).

@Labaien96 Thanks for the super fast reply, you were right!
If anyone else is experiencing the same issue: after squeezing the score and feeding it to "attention_weights" you need to Reshape "attention_weights" like the following to be able to compute the context_vector:
attention_weights = Reshape((attention_weights.shape[1], 1))(attention_weights)

@felixhao28 Thanks for you well-documented code and clarifications on the theory and implementation of Attention mechanism.
I'm using this code for a similar problem and obviously, the model requires training more than 1 epoch. However, when the training pass ~25 epochs, the loss changes to NAN. Since there is no problem in the data, I think it might be about the model architecture and following the online recommended solutions for the similar issue, I couldn't solve it. Did anyone else experience this behavior?

@fmehralian Try to use gradient clipping. You can use clipnorm and clipvalue. I experienced exploding gradients and was able to solve it using the clipping.

Hi I have two questions,

  1. from attention_3d_block()

attention_vector = Dense(128, use_bias=False, activation='tanh',name='attention_vector')(pre_activation)

for this line, the output unit is 128, is this based on something or just arbitrary/(based on intuition)

  1. from model_attention_applied_after_lstm()

output = Dense(INPUT_DIM, activation='sigmoid', name='output')(attention_mul)

for this line, must the output units be same as INPUT_DIM ? doesn't this defeat the purpose of activation='sigmoid' ?

if @felixhao28 or anyone else could help it'll be much appreciated

@junhuang-ifast

  1. Yes, 128 is a just hyper-parameter you can fine tune later.
  2. attention_mul has the size of BATCH_SIZE * ATTENTION_SIZE, so the output has the size of BATCH_SIZE * INPUT_DIM. It means for each batch, the output generates a probability for each category of all INPUT_DIM categories. The sigmoid applies to the last dimension, so the sum of probability for each batch is 1. This is not a part of attention mechanism but rather a typical category-classification output network.

@felixhao28 thanks for the quick responds. I have one other question regarding

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

which many have already asked you.

if we were to take only the last hidden state, would it be in a way saying that we are focusing on one specific part (last part in this case) of the lstm output to do the many-to-one problem. What if however, the intuition was that the whole input sequence were important in predicting the one output, would it be more suitable to use the mean along the time axis instead?

so something like

h_t = Lambda(lambda x: tf.reduce_mean(x, axis=1), output_shape=(unit,), name='mean_hidden_state')

PS: using the mean is just an example, it could be any other function depending on the problem

@felixhao28 thanks a ton for your useful comments! I haven't had time to work on this repo since then. I was pretty new to deep learning when I wrote it. I'm going to invest some time to integrate some of your suggestions and fix the things that need to be fixed :)

@junhuang-ifast In my application I was using attention in a sequence prediction model, which just focuses on the very next token in the sequence. Taking only the last hidden state worked fine due to the locality nature of sequences.

I am not an expert on applications other than sequence prediction. But if I have to guess, you can omit h_t all together (for example h_t = I, identity matrix). This will produce a self-attention vector.

Averaging all hidden states feels strange because by using attention, you are assuming not all elements in the sequence are equal. It is attentions' job to figure out which ones are more important and by how much. Using the mean of all states erases that difference. Unless there is a global information which differs by each sequence, hiding in each element and you want sum it up, I don't feel averaging is the way to go. I might be wrong though.

@philipperemy No problem. We are all learning it as we discuss it.

@felixhao28 just to be clear, when u say

h_t = I, identity matrix

would be the equivalent to not calculating h_t or the first dot product ie

h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states)
score = dot([score_first_part, h_t], [2, 1], name='attention_score')

and just letting score = score_first_part ?

@felixhao28 Do you have the link to the paper of this attention that was described in the TensorFlow tutorial?

@philipperemy the original link is gone but I think they are:
https://arxiv.org/abs/1409.0473
and
https://arxiv.org/abs/1508.04025

Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.

I updated the repo with all the comments of this thread. Thank you all!

Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.

Do you know a good implementation for local attention?

@philipperemy @felixhao28

Do you know how I can apply the attention module to a 2D shaped input , I would like to apply to apply attention after the LSTM layer-

Layer (type)                    Output Shape         Param #     Connected to                     
features (InputLayer)           (None, 16, 1816)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 2048)         31662080    features[0][0]                   
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1024)         2098176     lstm_1[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU)       (None, 1024)         0           dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 120)          123000      leaky_re_lu_2[0][0]              
__________________________________________________________________________________________________
feature_weights (InputLayer)    (None, 120)          0                                            
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 120)          0           dense_3[0][0]                    
                                                                 feature_weights[0][0]            

Total params: 33,883,256
Trainable params: 33,883,256
Non-trainable params: 0
__________________________________________________________________________________________________

Would really appreciate your suggestion on how to modify attention_3D block to make it work for a 2D input as well. thanks.

@raghavgurbaxani I answered you in your thread.

Hi @philipperemy and @felixhao28 . I am trying to apply attention model on top of an LSTM, where my input training data is a nd array. How should I fit my model in this case? I get the following error because of my data being a nd array

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

What changes should I make? Would appreciate your help! Thank you

Hi, thanks for all of uers' comments. I have learned a lot from that. But can I ask a question. If we use an RNN(or some variants of it), we can get the hidden states of each time_step which can then be used to compute the score. But if I did not use Lstm to be as an encoder, alternately, I use a 1D CNN as an encoder, what should I do when I want to apply attention. For example, I would like to handle some textual messages, so I first used an embedding layer and then used a 1DConv layer. Is there some methods I can use to apply the attention mechanism to my model. Thanks so much.