tsterbak/keras_attention

One to One keras model with Attention in Keras #25

Opened this issue · 0 comments

Hello,
I have a keras model that has sequence of inputs and sequence of outputs where each input has an associated output(Label). lets say (part of speech tagging (POS tagging)

Seq_in[0][0:3]
array([[15],[28], [23]])
Seq_out[0][0:3]

array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32)

I want to build attention on top of the lstm layer. I am following this work " Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification " Zhou et al, 2016

X_train, X_val, Y_train, Y_val = train_test_split(Seq_in,Seq_out, test_size=0.20)

TIME_STEPS = 500
INPUT_DIM = 1
lstm_units = 256

inputs = Input(shape=(TIME_STEPS, INPUT_DIM))
activations = Bidirectional(LSTM(lstm_units, return_sequences=True))(inputs) # First laer bidirictional
activations = Dropout(0.2)(activations)
activations = Bidirectional(LSTM(lstm_units, return_sequences=True))(activations) # Second layer bidirectional
activations = Dropout(0.2)(activations)
attention = Dense(1,activation='tanh')(activations) # This is equation (9) in the paper. Squashing each output state vector to a scaler.
attention = Flatten()(attention)
attention = Activation('softmax')(attention) # This is equation (10) in the paper.
attention = RepeatVector(512)(attention) # Repeating the softmax vector to have the same dimintion as the output state vector (512)
attention = Permute([2,1])(attention) # permute
sent_representation = multiply([activations,attention]) # multiply the attention vector with the output state vector element-wise.
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-1))(sent_representation) # summation of all output state vectors
sent_representation = RepeatVector(TIME_STEPS)(sent_representation) # Repeat vector to be the same diminsion as the time steps
sent_representation = concatenate([activations,sent_representation]) # concatenate the sentence representation to the output states
output = Dense(15, activation='softmax')(sent_representation)#(out_attention_mul) # Find the softmax for the current label

model = Model(inputs=inputs, outputs=output)
sgd = optimizers.SGD(lr=.1,momentum=0.9,decay=1e-3,nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
model.fit(X_train,Y_train,epochs=2, validation_data=(X_val, Y_val),verbose=1)

Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, 500, 1) 0

bidirectional_1 (Bidirectional) (None, 500, 512) 528384 input_1[0][0]

dropout_1 (Dropout) (None, 500, 512) 0 bidirectional_1[0][0]

bidirectional_2 (Bidirectional) (None, 500, 512) 1574912 dropout_1[0][0]

dropout_2 (Dropout) (None, 500, 512) 0 bidirectional_2[0][0]

dense_1 (Dense) (None, 500, 1) 513 dropout_2[0][0]

flatten_1 (Flatten) (None, 500) 0 dense_1[0][0]

activation_1 (Activation) (None, 500) 0 flatten_1[0][0]

repeat_vector_1 (RepeatVector) (None, 512, 500) 0 activation_1[0][0]

permute_1 (Permute) (None, 500, 512) 0 repeat_vector_1[0][0]

multiply_1 (Multiply) (None, 500, 512) 0 dropout_2[0][0]
permute_1[0][0]

lambda_1 (Lambda) (None, 500) 0 multiply_1[0][0]

repeat_vector_2 (RepeatVector) (None, 500, 500) 0 lambda_1[0][0]

concatenate_1 (Concatenate) (None, 500, 1012) 0 dropout_2[0][0]
repeat_vector_2[0][0]

dense_2 (Dense) (None, 500, 15) 15195 concatenate_1[0][0]
Total params: 2,119,004
Trainable params: 2,119,004
Non-trainable params: 0

I think this code performs what the paper does, except that the concatenate step merges the attention weights to all the output state vectors and do not change them for each time step so for each output label.
So I think, for each time step output, I have to do something so the attention weights differ. Am I right?
Any help is appreciated
Thanks in advance