- The Hidden Layer
- Posts
- LeDL Week 5: Meeting the Transformer
LeDL Week 5: Meeting the Transformer
Week 16 of documenting my AI/ML learning journey (Jan 26 - Feb 1)
What was discussed last week…
I learned how to identify and avoid “artifacts”, which are the many types of errors in transpose convolution.
I also learned about the capabilities of the
loss
(loss functions) parameter in the.compile()
function.
Today is a bit of a special post! All of this week’s work will be under one time range (as listed below, Tuesday to Saturday) due to the complexity of this topic and the busyness of this week.
Also, I’m going to start referring to “Learning Deep Learning” as “LeDL”, because I felt like the titles of these posts started to need to be a bit more “lively” if you know what I mean.
Tuesday, January 28th - Saturday, February 1st
Today I got into transformers! (not the movie series)
What are transformers? Being based on self-attention mechanisms, transformers are a type of deep learning architecture that has changed the world of natural language processing and other machine learning tasks. It’s the backbone of many state-of-the-art models, such as BERT and GPT (e.g., ChatGPT). They can also analyze sequential data other than text such as time series and audio.
Transformers are better than other traditional models that analyze sequential data, such as RNNs (Recurrent Neural Networks) and LSTMs (Long-Short-Term Memory), in terms of holding long-term dependencies (a long-term memory of important information to bring up at a later point) and parallelization (dividing tasks across multiple processing units, making processing more efficient).
“A self-attention mechanism is a key component of transformer models in machine learning, particularly in natural language processing tasks. It allows the model to weigh the importance of different elements in an input sequence relative to each other, enabling it to capture contextual relationships and long-range dependencies.”
Each word is given three matrices (2D arrays/lists) in self-attention: query, key, and value. After the dot product of the query and key vector is calculated, the resulting matrix is scaled (meaning it’s multiplied by the fraction: 1/[dimensionality of matrices q and k]) and optionally masked (only used in certain cases), meaning the matrix hides most of its data/tokens within the first row, which it then slowly starts to “widen its focus” by steadily revealing the other data/tokens after each row for the model to analyze. Finally, attention scores are computed, and a softmax function is applied, called the attention weights. Then, the attention weights are used to weigh the value vectors.
A transformer is comprised of three parts: first, the encoder, then the hidden layer, and then the decoder:
The Encoder (feat. self-attention)
The encoder part focuses on the “self-attention mechanism” part, where it decides what parts of the input with greater “attention” relative to other parts i.e. capturing dependencies, sensing the context and relationships. For example, in a NLP (natural language processor), if the transformer is given the following sentence as input:
What is @!d 4+4?
A well-trained transformer could figure out that the “@!d” part of the input should be given less “attention”.
In code, TensorFlow has a class called SelfAttention
for implementing self-attention mechanisms.

“MatMul” is dot product computation. As you can see, the query “Q” and key “K” vectors are multiplied together into a dot product, and then go through some stuff together, to eventually multiply again, but this time with the value “V” vector also into the attention weights, to make the context matrix: the combination of all of the matrices.
But there is also a more advanced version, MultiHeadSelfAttention
, which replicates the Q, K, and V matrices, creating multiple parallel attention operations: “heads”, and then does computation for each “head”, all of which will be combined at the end of the model to capture diverse relationships. The example code will be using MultiHeadSelfAttention
.

This is MultiHeadSelfAttention, baby!
Part 1: The __init__ method (MultiHeadSelfAttention)
class MultiHeadSelfAttention(Layer):
def __init__(self, embed_dim, num_heads=8):
super(MultiHeadSelfAttention, self).__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.projection_dim = embed_dim // num_heads
self.query_dense = Dense(embed_dim)
self.key_dense = Dense(embed_dim)
self.value_dense = Dense(embed_dim)
self.combine_heads = Dense(embed_dim)
Part 2: The attention and split heads methods (MultiHeadSelfAttention)
# continuation from last code snippet
def attention(self, query, key, value):
# The first convolution (MatMul)
score = tf.matmul(query, key, transpose_b=True)
# ".cast" is TensorFlow's way of typeCASTing variables
dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
# This is where the scaling function (i.e. 1/[dimensionality of matrices q and k]) comes in
scaled_score = score / tf.math.sqrt(dim_key)
# Softmax
weights = tf.nn.softmax(scaled_score, axis=-1)
# The second convolution (MatMul) aka weighing the "Value" matrix using the attention weights
output = tf.matmul(weights, value)
return output, weights
def split_heads(self, x, batch_size):
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
return tf.transpose(x, perm=[0, 2, 1, 3])
Part 3: The call method (MultiHeadSelfAttention)
# continuation from last code snippet
def call(self, inputs):
batch_size = tf.shape(inputs)[0] # the "0" in brackets is the size of the first dimension (zero-indexing) of the "inputs" matrix
query = self.query_dense(inputs)
key = self.key_dense(inputs)
value = self.value_dense(inputs)
# Split q, k, and v tensors into multiple heads to enable joint attention across different subspaces
query = self.split_heads(query, batch_size)
key = self.split_heads(key, batch_size)
value = self.split_heads(value, batch_size)
# Applying self-attention to the tensors
attention, _ = self.attention(query, key, value)
# Reshapes the tensors of attention weights
attention = tf.transpose(attention, perm=[0, 2, 1, 3])
# Reshape to prepare for combining heads
concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim))
# Combine heads by applying a linear transformation
output = self.combine_heads(concat_attention)
return output
After coding the (multi-head) self-attention mechanism, it’s put through a feedforward network with layer normalization and dropout layers in the TransformerBlock
class, sometimes called the EncoderLayer
class. Both of these classes would be identical in functionality. Still, the naming convention depends on whether the class will be used in BOTH encoders and decoders, or ONLY in the encoder part of the transformer, respectively. They’re named based on their context.
Nevertheless, when using TensorFlow, using TransformerBlock
or an equivalent class is essential when creating a transformer.
Part 4: The __init__ method (TransformerBlock/EncoderLayer)
# assume all the past libraries and modules have been imported
class TransformerBlock(Layer):
def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
super(TransformerBlock, self).__init__()
self.att = MultiHeadSelfAttention(embed_dim, num_heads)
# ffn = feed-forward (neural) network
self.ffn = tf.keras.Sequential([
Dense(ff_dim, activation="relu"),
Dense(embed_dim),
])
# two layer normalization layers and two layer dropout layers as per the standard Transformer encoder layer design
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(rate)
self.dropout2 = Dropout(rate)
Part 5: The call method (TransformerBlock/EncoderLayer)
# continuation from last code snippet
def call(self, inputs, training):
# The first set of layer norm and dropout layers is applied after the self-attention...
attn_output = self.att(inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
# While the second set is applied after the ffn.
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
Then, we can duplicate the TransformerBlock
, aka EncoderLayer
, however many times we need to (using a for loop) in a list/array, and then slap on a final dropout layer and/or a final normalization layer (in this example, there isn’t one), and that’s our transformer completed!
Part 6: The TransformerEncoder (whole model)
class TransformerEncoder(Layer):
def __init__(self, num_layers, embed_dim, num_heads, ff_dim, rate=0.1):
super(TransformerEncoder, self).__init__()
self.num_layers = num_layers
self.embed_dim = embed_dim
# "for" loop for duplicating layers
self.enc_layers = [TransformerBlock(embed_dim, num_heads, ff_dim, rate) for _ in range(num_layers)]
# final dropout layer
self.dropout = Dropout(rate)
def call(self, inputs, training=False):
x = inputs
for i in range(self.num_layers):
x = self.enc_layers[i](x, training=training)
return x
Lessons Learned
I learned what the heck a “transformer” is (btw, the “T” in ChatGPT stands for “transformer”).
Also, I learned how, because of how they leverage parallelization, transformers are better than RNNs and LSTMs in some cases.
Transformer models have three parts to them: the encoder, the hidden layer, and the decoder. Edit: this is not entirely true
Each layer in the encoder has two basic parts: (multi-head) self-attention, and the feed-forward neural network.
Resources
Course I followed:
Really good video that explains Multihead Attention: