PART F: Encoder-Decoder with Bahdanau & Luong Attention Mechanism

Welcome to Part F of the Seq2Seq Learning Tutorial Series. In this tutorial, we will design an Encoder-Decoder model to handle longer input and output sequences by using two global attention mechanisms: Bahdanau & Luong.

  • First, we will observe that the Basic Encoder-Decoder model will fail to handle long input sequences.
  • Then, we will discuss how to relate each output with all the inputs using the global attention mechanism.
  • We will implement the Bahdanau attention mechanism as a custom layer in Keras by using subclassing.
  • Then, we will integrate the attention layer to the Encoder-Decoder model to efficiently process the longer data.
  • After observing the effect of the attention layer on performance, we will depict the attention between inputs and outputs.
  • Lastly, we will code the Luong attention.

During the tutorial, we will be using the Encoder-Decoder model developed in Part C.

If you would like to follow up on all Seq2Seq tutorials, please subscribe to my YouTube Channel or follow my blog on Medium. Do not forget to turn on notifications so that you will be notified when new parts are uploaded.

You can access the whole code on Colab.

If you are ready, let’s get started!

References:

“Neural Machine Translation by Jointly Learning to Align and Translate” by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

“Effective Approaches to Attention-based Neural Machine Translation” by Minh-Thang Luong, Hieu Pham, Christopher D. Manning

Presentations:

Videos:

Notebooks:

SEQ2SEQ LEARNING SERIES:

  • This is the Part F of the Seq2Seq Learning series.
  • Please check out the previous parts, especially Part C to refresh the necessary background knowledge to follow this part with ease.

You can access all my SEQ2SEQ Learning videos on Murat Karakaya Akademi Youtube channel in ENGLISH or TURKISH

You can access my blog on Medium

Part A: AN INTRODUCTION TO SEQ2SEQ LEARNING AND A SAMPLE SOLUTION WITH MLP NETWORK

Part B: SEQ2SEQ LEARNING WITH RECURRENT NEURAL NETWORKS (LSTM)

Part C: SEQ2SEQ LEARNING WITH A BASIC ENCODER DECODER MODEL

Part D: SEQ2SEQ LEARNING WITH AN ENCODER DECODER MODEL WITH TEACHER FORCING

Part E: SEQ2SEQ LEARNING WITH AN ENCODER DECODER MODEL WITH TEACHER FORCING FOR VARIABLE INPUT AND OUTPUT SIZE: MASKING & PADDING

Part F: SEQ2SEQ LEARNING WITH AN ENCODER DECODER MODEL + BAHDANAU & LUONG ATTENTION

You can watch this notebook on Murat Karakaya Akademi channel for details:

WHY WE HAVE SO MANY PARTS?

  • We aim to code an Encoder-Decoder Model with Attention.
  • However, I would like to develop the solution by showing the shortcomings of other possible approaches.
  • Therefore, in the first 2 parts, we will observe that initial models have their own weaknesses.
  • We also understand why the Encoder-Decoder paradigm is so successful.

So, please patiently follow the parts as we develop a better solution :)

Sample Problem:

In a real-life scenario:

  • We are given 2 parallel data sets X and y such that X[i] and y[i] have some relationship
  • For instance: we are given the same book’s text in English (X) and Turkish (y).
  • Thus the statement X[i] in English is translated into Turkish as y[i] statement
  • We use the parallel date set to train a seq2seq model which would learn how to convert/transform X[i] to y[i]

Below, we will develop such an encoder-decoder model for fixed-size input and output sequences

The sample problem is to find the reverse of a given sequence

Given sequence X

X=[3, 2, 9, 1]

Output sequence (y) is the reversed input sequence (X)

y=[1, 9, 2, 3]

Configure the sample problem

#@title Configure problem

n_timesteps_in = 4
#each input sample has 4 values

n_features = 10
#each value is one_hot_encoded with 10 0/1

train_size= 2000
test_size = 200

For each input sequence (X), selecting 4 random numbers beteen 1 and 10 (0 is reserved )

A sample X
X=[2, 9, 9, 2]

reversed input sequence (X) is the output sequence (y)
y=[2, 9, 9, 2]

Each input and output sequences are converted one_hot_encoded format in 10 dimensions
X=[[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1]
[0 0 1 0 0 0 0 0 0 0]]
y=[[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 1]
[0 0 1 0 0 0 0 0 0 0]]

Generated sequence datasets as follows (batch_size,time_steps, features)
X_train.shape: (2000, 4, 10) y_train.shape: (2000, 4, 10)
X_test.shape: (200, 4, 10) y_test.shape: (200, 4, 10)
time: 79.7 ms

QUICK LSTM REMINDER

Image for post
Image for post
  • An LSTM layer accepts a series of inputs. Here the input [1, 2, 3, 4] has four-time steps.
  • The LSTM layer process input step by step:
  • Time step 1: input 1 is processed and 3 outputs are generated 2 hidden states (the same values) and 1 cell state.
  • One hidden state is outputted and can be used for prediction or to connect another layer of LSTM
  • The other hidden state and the cell state are used for the next time step.
  • At the end of the time steps (here 4), the LSTM layer outputs:
  • Each time step’s hidden states
  • Last time step’s hidden state
  • Last time step’s cell state

You can think above figure as a simple Encoder:

  • Encoder processes the input and provides the context vector (Last time step’s hidden state + Last time step’s cell state) for the decoder. *Let’s see the details below.

A BASIC LSTM-BASED ENCODER & DECODER MODEL

Image for post
Image for post

Encoder receives encoder input data and

  • process it
  • outputs its last hidden state + last cell state as the context vector
  • transfers this context vector to the decoder

Decoder:

  • Decoder’s initial states (hidden state + cell state) are the context vector sent by the encoder
  • Decoder’s initial input is a special symbol such as ‘START

Then, Decoder runs in a loop. At each time step, Decoder:

  • consumes the input and states
  • outputs its
  • last hidden state,
  • last hidden state (yes again!),
  • last cell state
  • uses last hidden state as the next input for itself
  • uses last hidden state + last cell state as the next states for itself
  • uses last hidden state for the prediction for the current time step

For details about Encoder-Decoder Model and LSTM, you can check my Youtube Playlists:

Let’s review the code

You can match the above figure with the below code.

Image for post
Image for post

Here is the complete code:


#@title latentSpaceDimension is the dimension of the each state vector
latentSpaceDimension = 16
def create_hard_coded_decoder_input_model(batch_size):
# The first part is encoder
encoder_inputs = Input(shape=(n_timesteps_in, n_features), name='encoder_inputs')
encoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='encoder_lstm')
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

# initial context vector is the states of the encoder
states = [state_h, state_c]

# Set up the decoder layers
decoder_inputs = Input(shape=(1, n_features))
decoder_lstm = LSTM(latentSpaceDimension, return_sequences=True, return_state=True, name='decoder_lstm')
decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []
# Prepare decoder input data that just contains the start character 0
# Note that we made it a constant one-hot-encoded in the model
# that is, [1 0 0 0 0 0 0 0 0 0] is the initial input for each loop
decoder_input_data = np.zeros((batch_size, 1, n_features))
decoder_input_data[:, 0, 0] = 1 #

# that is, [1 0 0 0 0 0 0 0 0 0] is the initial input for each loop
inputs = decoder_input_data
# decoder will only process one timestep at a time.
for _ in range(n_timesteps_in):
# Run the decoder on one timestep
outputs, state_h, state_c = decoder_lstm(inputs,
initial_state=states)
outputs = decoder_dense(outputs)
# Store the current prediction (we will concatenate all predictions later)
all_outputs.append(outputs)
# Reinject the outputs as inputs for the next loop iteration
# as well as update the states
inputs = outputs
states = [state_h, state_c]

# Concatenate all predictions such as [batch_size, timesteps, features]
decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)

# Define and compile model
model = Model(encoder_inputs, decoder_outputs, name='model_encoder_decoder')
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
return model
  • Create and compile the model
batch_size = 10
model_encoder_decoder=create_hard_coded_decoder_input_model(batch_size=batch_size)
#model_encoder_decoder.summary()
time: 1.17 s

Train model

Actually, you can train the model with a simple fit method as below.

model_encoder_decoder.fit(X_train, y_train,
batch_size=batch_size,
epochs=30,
validation_split=0.2)

However, I will use my train function which implements Early Stopping monitoring Validation Accuracy for comparison reasons.

train_test(model_encoder_decoder, X_train, y_train , X_test, 	y_test, batch_size=batch_size,epochs=40,patience=5 ,verbose=1)training for  40  epochs begins with EarlyStopping(monitor= val_accuracy, patience= 5 )....
Epoch 1/40
180/180 [==============================] - 10s 14ms/step - loss: 2.2497 - accuracy: 0.2251 - val_loss: 1.9826 - val_accuracy: 0.3275
***
***
Epoch 35/40
180/180 [==============================] - 1s 6ms/step - loss: 0.0177 - accuracy: 1.0000 - val_loss: 0.0333 - val_accuracy: 0.9975
Epoch 36/40
180/180 [==============================] - 1s 6ms/step - loss: 0.0142 - accuracy: 1.0000 - val_loss: 0.0294 - val_accuracy: 0.9950
Epoch 00036: early stopping
40 epoch training finished...

PREDICTION ACCURACY (%):
Train: 99.950, Test: 100.000
png
png
10 examples from test data...
Input Expected Predicted T/F
[2, 3, 1, 1] [1, 1, 3, 2] [1, 1, 3, 2] True
[7, 9, 1, 6] [6, 1, 9, 7] [6, 1, 9, 7] True
[2, 9, 3, 8] [8, 3, 9, 2] [8, 3, 9, 2] True
[7, 7, 9, 9] [9, 9, 7, 7] [9, 9, 7, 7] True
[7, 1, 1, 7] [7, 1, 1, 7] [7, 1, 1, 7] True
[8, 4, 6, 9] [9, 6, 4, 8] [9, 6, 4, 8] True
[2, 5, 9, 9] [9, 9, 5, 2] [9, 9, 5, 2] True
[2, 2, 4, 5] [5, 4, 2, 2] [5, 4, 2, 2] True
[8, 7, 5, 7] [7, 5, 7, 8] [7, 5, 7, 8] True
[4, 9, 9, 1] [1, 9, 9, 4] [1, 9, 9, 4] True
Accuracy: 1.0
time: 47.6 s

Observations

  • When the sequence size (n_timesteps_in) is 4 (Encoder-Decoder model terminates at Epoch 31 with 99% accuracy.

ATTENTION MECHANISM

Why?

According to the inventors “Neural Machine Translation by Jointly Learning to Align and Translate” by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio:

  • “One of the motivations behind the proposed approach (attention mechanism) was the use of a fixed-length context vector in the basic encoder-decoder approach. We conjectured that this limitation may make the basic encoder-decoder approach to underperform with long sentences. “

We can check the validity of these arguments by increasing the sequence size (n_timesteps_in) to 16

  • Remember that when the sequence size (n_timesteps_in) is 4 (Encoder-Decoder model terminates at Epoch 31 with 99% accuracy.
  • However, when the sequence size (n_timesteps_in) is 16 Encoder-Decoder model runs all the 40 epochs and finishes with only 36% accuracy!

That is, as argued, the Encoder-Decoder model underperforms with long sequences.

How does it work?

According to the inventors:

  • “The proposed approach provides an intuitive way to inspect the (soft-) alignment between the words in a generated translation and those in a source sentence”.

To understand how the attention mechanism works, first compare the Encoder-Decoder Model we coded above with an Encoder-Decoder model including the attention mechanism in figures

Image for post
Image for post

Note that:

In the above figure, the Encoder-Decoder Model we have coded use

  • only the decoder’s last hidden and cell states
  • the decoder’s states as an initial context vector-only once
  • In the below figure, the Encoder-Decoder model with the attention mechanism:
  • We use not only the last hidden and cell states but also the decoder’s hidden states generated at all the time steps
  • We use all the decoder’s hidden states at all consecutive time steps

Basically:

  • First, we initialize the Decoder states by using the last states of the Encoder as usual
  • Then at each decoding time step:
  • We use Encoder’s all hidden states and the previous Decoder’s output to calculate a Context Vector by applying an Attention Mechanism
  • Lastly, we concatenate the Context Vector with the previous Decoder’s output to create the input to the decoder.

I will provide a more detailed explanation about the model after discussing and implementing Bahdanau attention.

Image for post
Image for post

Attention: How to calculate Context Vector

According to “Effective Approaches to Attention-based Neural Machine Translation” by Minh-Thang Luong, Hieu Pham, Christopher D. Manning, the attention mechanism above is called “Global Attention”:

“The idea of a global attentional model is to consider all the hidden states of the encoder $h_{s}$ when deriving the context vector $c_{t}$

That is, we attend to all the decoder outputs for generating each decoder’s output as follows:

Notation

h_{s}: all the hidden states of the encoder

h_{t}: previous hidden states of the decoder (previous time step output)

c_{t}: context vector

W: Weight matrix for parametrizing the calculations

Calculate a score to relate the Encoder’s all hidden states and the previous Decoder’s output

There are many different scores proposed by researchers. The most important ones are:

Image for post
Image for post

You can think of these scores as the level of relationship between the Encoder’s all hidden states and the previous Decoder’s output.

We use $W$ matrices to parametrize the calculations. That is, we will learn the weight values during training via backpropagation. The model will learn how to calculate better scores.

$tanh$ is a single hidden layer network model here.

$v$ is another single hidden layer network model here.

As a result of the above model, we expect that these layers ($ W, tanh, v$) will learn how to calculate a suitable score during training.

Calculate the Attention Weights by normalizing the scores.

These are the weights for each decoder hidden state $h_{s}$.

Simply, we can use softmax() to calculate the probability distribution.

Image for post
Image for post

Calculate the Context Vector by applying the Attention Weights onto decoder hidden states $h_{s}$.

Thus, we will have weighted decoder hidden states $h_{s}$ at the end

Image for post
Image for post

After calculating the context vector, we can concatenate it with the previous decoder hidden state (output) to generate the input for the next decoder output.

Let’s code Bahdanau Attention Layer

First, I would like to share with you the core code snippet:

Image for post
Image for post

I borrowed the below code from Tensorflow official web site and appended necessary comments to relate the above formula with the below code.

Please pay attention to each tensor dimensions. That is really important for understanding how it all works together!

class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units, verbose=0):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
self.verbose= verbose

def call(self, query, values):
if self.verbose:
print('\n******* Bahdanau Attention STARTS******')
print('query (decoder hidden state): (batch_size, hidden size) ', query.shape)
print('values (encoder all hidden state): (batch_size, max_len, hidden size) ', values.shape)

# query hidden state shape == (batch_size, hidden size)
# query_with_time_axis shape == (batch_size, 1, hidden size)
# values shape == (batch_size, max_len, hidden size)
# we are doing this to broadcast addition along the time axis to calculate the score
query_with_time_axis = tf.expand_dims(query, 1)

if self.verbose:
print('query_with_time_axis:(batch_size, 1, hidden size) ', query_with_time_axis.shape)

# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
score = self.V(tf.nn.tanh(
self.W1(query_with_time_axis) + self.W2(values)))

if self.verbose:
print('score: (batch_size, max_length, 1) ',score.shape)
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
if self.verbose:
print('attention_weights: (batch_size, max_length, 1) ',attention_weights.shape)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
if self.verbose:
print('context_vector before reduce_sum: (batch_size, max_length, hidden_size) ',context_vector.shape)
context_vector = tf.reduce_sum(context_vector, axis=1)
if self.verbose:
print('context_vector after reduce_sum: (batch_size, hidden_size) ',context_vector.shape)
print('\n******* Bahdanau Attention ENDS******')
return context_vector, attention_weights

time: 26.4 ms

Integrate the attention layer into the Encoder-Decoder model

Image for post
Image for post

In an Encoder-Decoder with Attention Layer set-up,

Encoder provides:

  • the initial states by sending its last hidden state + last cell state
  • the context vector by sending its all hidden states

The decoder needs 2 inputs to generate/predict an output:

  • an input tensor
  • a state tensor

The decoder:

  • initializes its state by consuming the ***initial state***s
  • uses decoder’s last hidden state as the initial input
  • calculates attention vector using initial input + encoder’s all hidden states
  • applies the attention to encoder’s all hidden states finds the context vector
  • concatenate context vector + START to generate the decoder input
  • then runs in a loop:
  • consume the input and states
  • outputs its last hidden state, last hidden state (yes again!), last cell state,
  • use last hidden state + last cell state as the next state
  • use last hidden state as the next initial input
  • calculates attention vector using initial input + encoder’s all hidden states
  • applies the attention to encoder’s all hidden states finds the context vector
  • concatenate context vector+ initial input to generate the decoder input
verbose= 0 
#See all debug messages

batch_size=1
if verbose:
print('***** Model Hyper Parameters *******')
print('latentSpaceDimension: ', latentSpaceDimension)
print('batch_size: ', batch_size)
print('sequence length: ', n_timesteps_in)
print('n_features: ', n_features)

print('\n***** TENSOR DIMENSIONS *******')

# The first part is encoder
encoder_inputs = Input(shape=(n_timesteps_in, n_features), name='encoder_inputs')
encoder_lstm = LSTM(latentSpaceDimension,return_sequences=True, return_state=True, name='encoder_lstm')
encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm(encoder_inputs)

if verbose:
print ('Encoder output shape: (batch size, sequence length, latentSpaceDimension) {}'.format(encoder_outputs.shape))
print ('Encoder Hidden state shape: (batch size, latentSpaceDimension) {}'.format(encoder_state_h.shape))
print ('Encoder Cell state shape: (batch size, latentSpaceDimension) {}'.format(encoder_state_c.shape))
# initial context vector is the states of the encoder
encoder_states = [encoder_state_h, encoder_state_c]
if verbose:
print(encoder_states)
# Set up the attention layer
attention= BahdanauAttention(latentSpaceDimension, verbose=verbose)


# Set up the decoder layers
decoder_inputs = Input(shape=(1, (n_features+latentSpaceDimension)),name='decoder_inputs')
decoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='decoder_lstm')
decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []

# 1 initial decoder's input data
# Prepare initial decoder input data that just contains the start character
# Note that we made it a constant one-hot-encoded in the model
# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop
# one-hot encoded zero(0) is the start symbol
inputs = np.zeros((batch_size, 1, n_features))
inputs[:, 0, 0] = 1


# 2 initial decoder's state
# encoder's last hidden state + last cell state
decoder_outputs = encoder_state_h
states = encoder_states
if verbose:
print('initial decoder inputs: ', inputs.shape)

# decoder will only process one time step at a time.
for _ in range(n_timesteps_in):

# 3 pay attention
# create the context vector by applying attention to
# decoder_outputs (last hidden state) + encoder_outputs (all hidden states)
context_vector, attention_weights=attention(decoder_outputs, encoder_outputs)
if verbose:
print("Attention context_vector: (batch size, units) {}".format(context_vector.shape))
print("Attention weights : (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
print('decoder_outputs: (batch_size, latentSpaceDimension) ', decoder_outputs.shape )

context_vector = tf.expand_dims(context_vector, 1)
if verbose:
print('Reshaped context_vector: ', context_vector.shape )

# 4. concatenate the input + context vectore to find the next decoder's input
inputs = tf.concat([context_vector, inputs], axis=-1)

if verbose:
print('After concat inputs: (batch_size, 1, n_features + hidden_size): ',inputs.shape )

# 5. passing the concatenated vector to the LSTM
# Run the decoder on one timestep with attended input and previous states
decoder_outputs, state_h, state_c = decoder_lstm(inputs,
initial_state=states)
#decoder_outputs = tf.reshape(decoder_outputs, (-1, decoder_outputs.shape[2]))

outputs = decoder_dense(decoder_outputs)
# 6. Use the last hidden state for prediction the output
# save the current prediction
# we will concatenate all predictions later
outputs = tf.expand_dims(outputs, 1)
all_outputs.append(outputs)
# 7. Reinject the output (prediction) as inputs for the next loop iteration
# as well as update the states
inputs = outputs
states = [state_h, state_c]


# 8. After running Decoder for max time steps
# we had created a predition list for the output sequence
# convert the list to output array by Concatenating all predictions
# such as [batch_size, timesteps, features]
decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)

# 9. Define and compile model
model_encoder_decoder_Bahdanau_Attention = Model(encoder_inputs, decoder_outputs, name='model_encoder_decoder')
model_encoder_decoder_Bahdanau_Attention.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

TRAIN THE MODEL WITH ATTENTION

train_test(model_encoder_decoder_Bahdanau_Attention, X_train, y_train , X_test,
y_test, batch_size=batch_size,epochs=40, patience=3, verbose=1)
training for 40 epochs begins with EarlyStopping(monitor= val_accuracy, patience= 3 )....
Epoch 1/40
1800/1800 [==============================] - 19s 7ms/step - loss: 2.0012 - accuracy: 0.2725 - val_loss: 1.4980 - val_accuracy: 0.4025
***
Epoch 7/40
1800/1800 [==============================] - 11s 6ms/step - loss: 5.6494e-04 - accuracy: 0.9999 - val_loss: 1.1552e-06 - val_accuracy: 1.0000
Epoch 8/40
1800/1800 [==============================] - 11s 6ms/step - loss: 7.0385e-05 - accuracy: 1.0000 - val_loss: 1.7178e-07 - val_accuracy: 1.0000
Epoch 00008: early stopping
40 epoch training finished...

PREDICTION ACCURACY (%):
Train: 100.000, Test: 100.000
png
png
10 examples from test data...
Input Expected Predicted T/F
[2, 3, 1, 1] [1, 1, 3, 2] [1, 1, 3, 2] True
[7, 9, 1, 6] [6, 1, 9, 7] [6, 1, 9, 7] True
[2, 9, 3, 8] [8, 3, 9, 2] [8, 3, 9, 2] True
[7, 7, 9, 9] [9, 9, 7, 7] [9, 9, 7, 7] True
[7, 1, 1, 7] [7, 1, 1, 7] [7, 1, 1, 7] True
[8, 4, 6, 9] [9, 6, 4, 8] [9, 6, 4, 8] True
[2, 5, 9, 9] [9, 9, 5, 2] [9, 9, 5, 2] True
[2, 2, 4, 5] [5, 4, 2, 2] [5, 4, 2, 2] True
[8, 7, 5, 7] [7, 5, 7, 8] [7, 5, 7, 8] True
[4, 9, 9, 1] [1, 9, 9, 4] [1, 9, 9, 4] True
Accuracy: 1.0
time: 1min 38s

Observations

When the sequence size (n_timesteps_in) is 4

  • Encoder-Decoder model terminates at Epoch 31 with 99% accuracy.
  • Encoder-Decoder Model with Attention terminates at Epoch 9 with 100%

However, when the sequence size (n_timesteps_in) is 16

  • Encoder-Decoder model runs all the 40 epochs and finishes with only 36% accuracy!.
  • Encoder-Decoder Model with Attention terminates at Epoch 16 with 99%

We can conclude that the Encoder-Decoder model with Attention is much more scalable in terms of sequence length.

PREDICT WITH THE TRAINED MODEL

pred=model_encoder_decoder_Bahdanau_Attention.predict(X_test[0].reshape(1,n_timesteps_in,n_features), batch_size=1)
print('input', one_hot_decode(X_test[0]))
print('expected', one_hot_decode(y_test[0]))
print('predicted', one_hot_decode(pred[0]))
input [2, 3, 1, 1]
expected [1, 1, 3, 2]
predicted [1, 1, 3, 2]
time: 46.1 ms

BONUS 1: DEPICT THE ATTENTION

  • For a sample input, we will record the attention values for each time step
  • Then, we will use the color codes to depict the relation between inputs and outputs
  • The lighter colors indicate higher values (attention)
  • That is, the model pays more attention to some inputs when creating each output (prediction)
  • The model learns where to look for to create the output!
def evaluate(seq_in):
attention_plot = np.zeros((n_timesteps_in, n_timesteps_in))
print ('attention_plot shape: (n_timesteps_in, n_timesteps_in) {}'.format(attention_plot.shape))


#sequence = [7, 9, 8, 5]
sequence = one_hot_encode(seq_in,n_features)
encoder_inputs=array(sequence).reshape(1,n_timesteps_in,n_features)

encoder_inputs = tf.convert_to_tensor(encoder_inputs,dtype=tf.float32)
print ('Encoder input shape: (batch size, sequence length, n_features) {}'.format(encoder_inputs.shape))


encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

print ('Encoder output shape: (batch size, sequence length, latentSpaceDimension) {}'.format(encoder_outputs.shape))
print ('Encoder Hidden state shape: (batch size, latentSpaceDimension) {}'.format(state_h.shape))
print ('Encoder Cell state shape: (batch size, latentSpaceDimension) {}'.format(state_c.shape))
# initial context vector is the states of the encoder
states = [state_h, state_c]

# Set up the attention layer
#attention= BahdanauAttention(latentSpaceDimension)


# Set up the decoder layers
#decoder_inputs = Input(shape=(1, (n_features+latentSpaceDimension)))
#decoder_lstm = LSTM(latentSpaceDimension, return_state=True, name='decoder_lstm')
#decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')

all_outputs = []

#INIT DECODER
# Prepare decoder input data that just contains the start character 0
# Note that we made it a constant one-hot-encoded in the model
# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop
decoder_input_data = np.zeros((1, 1, n_features))
decoder_input_data[:, 0, 0] = 1
# that is, [1 0 0 0 0 0 0 0 0 0] is the first input for each loop
inputs = decoder_input_data
#initial hiiden state
decoder_outputs = state_h

print('initial decoder inputs: ', inputs.shape)

# decoder will only process one timestep at a time.
for t in range(n_timesteps_in):

# pay attention
context_vector, attention_weights=attention(decoder_outputs, encoder_outputs)



print("Attention context_vector: (batch size, units) {}".format(context_vector.shape))
print("Attention weights : (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

# storing the attention weights to plot later on
attention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()


decoder_outputs=tf.expand_dims(decoder_outputs, 1)
print('decoder_outputs: (batch_size, 1, latentSpaceDimension) ', decoder_outputs.shape )

context_vector = tf.expand_dims(context_vector, 1)
print('Reshaped context_vector: ', context_vector.shape )
inputs = tf.concat([context_vector, inputs], axis=-1)
#inputs = tf.concat([context_vector, inputs], axis=-1)
print('After concat inputs: (batch_size, 1, n_features + hidden_size): ',inputs.shape )

# passing the concatenated vector to the LSTM
# Run the decoder on one timestep
decoder_outputs, state_h, state_c = decoder_lstm(inputs,
initial_state=states)
#decoder_outputs = tf.reshape(decoder_outputs, (-1, decoder_outputs.shape[2]))

outputs = decoder_dense(decoder_outputs)
# Store the current prediction (we will concatenate all predictions later)
outputs = tf.expand_dims(outputs, 1)
all_outputs.append(outputs)
# Reinject the outputs as inputs for the next loop iteration
# as well as update the states
inputs = outputs
states = [state_h, state_c]



# Concatenate all predictions such as [batch_size, timesteps, features]
decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)
seq_out=one_hot_decode(decoder_outputs[0])

return seq_in, seq_out, attention_plot
time: 63.8 ms# function for plotting the attention weights
def plot_attention(attention, sequence, predicted_sequence):
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(1, 1, 1)
ax.matshow(attention, cmap='viridis')

fontdict = {'fontsize': 14}

ax.set_xticklabels([''] + sequence, fontdict=fontdict, rotation=90)
ax.set_yticklabels([''] + predicted_sequence, fontdict=fontdict)

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()
time: 10.3 msdef translate(seq_in):
seq_in, seq_out, attention_plot = evaluate(seq_in)

print('Input: %s' % (seq_in))
print('Predicted translation: {}'.format(seq_out))

attention_plot = attention_plot[:len(seq_out), :len(seq_in)]
plot_attention(attention_plot, seq_in, seq_out)
time: 3.59 ms#translate([1, 2, 3 ,4,5,6,7,8,9,1,2,3,4,5,7,6])
translate([1, 2, 3 ,4])
Input: [1, 2, 3, 4]
Predicted translation: [4, 3, 2, 1]
png
time: 176 ms

BONUS 2: LUONG ATTENTION

As we discussed, the only difference between Bahdanau and Luong Attention is the calculation of the score.

Here we use dot product for finding score as suggested by Luong et.al.

#LUONGH Dot-product score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed) , perm=[0, 2, 1])

Image for post
Image for post
class LuongAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(LuongAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)

def call(self, query, values):
print('\n******* Luong Attention STARTS******')
print('query (decoder hidden state): (batch_size, hidden size) ', query.shape)
print('values (encoder all hidden state): (batch_size, max_len, hidden size) ', values.shape)

# query hidden state shape == (batch_size, hidden size)
# query_with_time_axis shape == (batch_size, 1, hidden size)
# values shape == (batch_size, max_len, hidden size)
# we are doing this to broadcast addition along the time axis to calculate the score
query_with_time_axis = tf.expand_dims(query, 1)

print('query_with_time_axis:(batch_size, 1, hidden size) ', query_with_time_axis.shape)


values_transposed = tf.transpose(values, perm=[0, 2, 1])
print('values_transposed:(batch_size, hidden size, max_len) ', values_transposed.shape)

# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
#BAHDANAU ADDITIVE:
#score = self.V(tf.nn.tanh(
# self.W1(query_with_time_axis) + self.W2(values)))

#LUONGH Dot-product
score = tf.transpose(tf.matmul(query_with_time_axis, values_transposed) , perm=[0, 2, 1])

print('score: (batch_size, max_length, 1) ',score.shape)
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
print('attention_weights: (batch_size, max_length, 1) ',attention_weights.shape)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
print('context_vector before reduce_sum: (batch_size, max_length, hidden_size) ',context_vector.shape)
context_vector = tf.reduce_sum(context_vector, axis=1)
print('context_vector after reduce_sum: (batch_size, hidden_size) ',context_vector.shape)


print('\n******* Luong Attention ENDS******')
return context_vector, attention_weights

WRAP UP

  • We have discussed Encoder-Decoder Model and observed that when the input sequence gets longer Encoder-Decoder model can not relate the output to the input successfully
  • We have seen that we can relate each output with each input by using an attention mechanism.
  • In the experiments, the positive effect of the attention mechanism on performance is clear.
  • We implemented Bahdanau style (Additive) attention which is a global attention mechanism.
  • Another global attention mechanism is Luong Attention (Multiplicative) in which only the calculation of the score values differs.
  • If only dot product use in Luong Attention (Multiplicative), it would be more efficient in terms of score calculations
  • We can summarize the methods to find score values as below:
Image for post
Image for post

DO IT YOURSELF

  • Add Teacher Forcing
  • Add Beam Search
  • Convert Bahdanau Attention to Luong Attention and re-run the experiments
  • Use Function Subclassing API for creating Encoder and Decoder

You can follow Murat Karakaya Akademi on these social networks:

YouTube

Facebook

Instagram

LinkedIn

Github

Kaggle

Medium

Written by

Assoc. Prof. Computer Engineering

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store