The Hidden Layer
Posts
LeDL Week 5: Meeting the Transformer

LeDL Week 5: Meeting the Transformer

Week 16 of documenting my AI/ML learning journey (Jan 26 - Feb 1)

Brandon Kim
February 09, 2025

What was discussed last week…

I learned how to identify and avoid “artifacts”, which are the many types of errors in transpose convolution.
I also learned about the capabilities of the loss (loss functions) parameter in the .compile() function.

Today is a bit of a special post! All of this week’s work will be under one time range (as listed below, Tuesday to Saturday) due to the complexity of this topic and the busyness of this week.

Also, I’m going to start referring to “Learning Deep Learning” as “LeDL”, because I felt like the titles of these posts started to need to be a bit more “lively” if you know what I mean.

Tuesday, January 28th - Saturday, February 1st

Today I got into transformers! (not the movie series)

What are transformers? Being based on self-attention mechanisms, transformers are a type of deep learning architecture that has changed the world of natural language processing and other machine learning tasks. It’s the backbone of many state-of-the-art models, such as BERT and GPT (e.g., ChatGPT). They can also analyze sequential data other than text such as time series and audio.

Transformers are better than other traditional models that analyze sequential data, such as RNNs (Recurrent Neural Networks) and LSTMs (Long-Short-Term Memory), in terms of holding long-term dependencies (a long-term memory of important information to bring up at a later point) and parallelization (dividing tasks across multiple processing units, making processing more efficient).

“A self-attention mechanism is a key component of transformer models in machine learning, particularly in natural language processing tasks. It allows the model to weigh the importance of different elements in an input sequence relative to each other, enabling it to capture contextual relationships and long-range dependencies.”

(perplexity.ai, 2025)

Each word is given three matrices (2D arrays/lists) in self-attention: query, key, and value. After the dot product of the query and key vector is calculated, the resulting matrix is scaled (meaning it’s multiplied by the fraction: 1/[dimensionality of matrices q and k]) and optionally masked (only used in certain cases), meaning the matrix hides most of its data/tokens within the first row, which it then slowly starts to “widen its focus” by steadily revealing the other data/tokens after each row for the model to analyze. Finally, attention scores are computed, and a softmax function is applied, called the attention weights. Then, the attention weights are used to weigh the value vectors.

A transformer is comprised of three parts: first, the encoder, then the hidden layer, and then the decoder:

The Encoder (feat. self-attention)

The encoder part focuses on the “self-attention mechanism” part, where it decides what parts of the input with greater “attention” relative to other parts i.e. capturing dependencies, sensing the context and relationships. For example, in a NLP (natural language processor), if the transformer is given the following sentence as input:

What is @!d 4+4?

A well-trained transformer could figure out that the “@!d” part of the input should be given less “attention”.

In code, TensorFlow has a class called SelfAttention for implementing self-attention mechanisms.

“MatMul” is dot product computation. As you can see, the query “Q” and key “K” vectors are multiplied together into a dot product, and then go through some stuff together, to eventually multiply again, but this time with the value “V” vector also into the attention weights, to make the context matrix: the combination of all of the matrices.

But there is also a more advanced version, MultiHeadSelfAttention, which replicates the Q, K, and V matrices, creating multiple parallel attention operations: “heads”, and then does computation for each “head”, all of which will be combined at the end of the model to capture diverse relationships. The example code will be using MultiHeadSelfAttention.

This is MultiHeadSelfAttention, baby!

Part 1: The init method (MultiHeadSelfAttention)

class MultiHeadSelfAttention(Layer): 
    def __init__(self, embed_dim, num_heads=8): 
        super(MultiHeadSelfAttention, self).__init__() 
        self.embed_dim = embed_dim 
        self.num_heads = num_heads 
        self.projection_dim = embed_dim // num_heads 
        self.query_dense = Dense(embed_dim) 
        self.key_dense = Dense(embed_dim) 
        self.value_dense = Dense(embed_dim) 
        self.combine_heads = Dense(embed_dim)

Part 2: The attention and split heads methods (MultiHeadSelfAttention)

    # continuation from last code snippet
    def attention(self, query, key, value): 
        # The first convolution (MatMul)
        score = tf.matmul(query, key, transpose_b=True)
        # ".cast" is TensorFlow's way of typeCASTing variables
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32) 
        # This is where the scaling function (i.e. 1/[dimensionality of matrices q and k]) comes in
        scaled_score = score / tf.math.sqrt(dim_key)
        # Softmax
        weights = tf.nn.softmax(scaled_score, axis=-1) 
        # The second convolution (MatMul) aka weighing the "Value" matrix using the attention weights
        output = tf.matmul(weights, value) 
        return output, weights 
    def split_heads(self, x, batch_size): 
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim)) 
        return tf.transpose(x, perm=[0, 2, 1, 3])

Part 3: The call method (MultiHeadSelfAttention)

    # continuation from last code snippet
    def call(self, inputs): 
        batch_size = tf.shape(inputs)[0] # the "0" in brackets is the size of the first dimension (zero-indexing) of the "inputs" matrix
        query = self.query_dense(inputs) 
        key = self.key_dense(inputs) 
        value = self.value_dense(inputs) 
        # Split q, k, and v tensors into multiple heads to enable joint attention across different subspaces
        query = self.split_heads(query, batch_size) 
        key = self.split_heads(key, batch_size) 
        value = self.split_heads(value, batch_size) 
        # Applying self-attention to the tensors
        attention, _ = self.attention(query, key, value) 
        # Reshapes the tensors of attention weights
        attention = tf.transpose(attention, perm=[0, 2, 1, 3]) 
        # Reshape to prepare for combining heads
        concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim)) 
        # Combine heads by applying a linear transformation
        output = self.combine_heads(concat_attention) 
        return output

After coding the (multi-head) self-attention mechanism, it’s put through a feedforward network with layer normalization and dropout layers in the TransformerBlock class, sometimes called the EncoderLayer class. Both of these classes would be identical in functionality. Still, the naming convention depends on whether the class will be used in BOTH encoders and decoders, or ONLY in the encoder part of the transformer, respectively. They’re named based on their context.

Nevertheless, when using TensorFlow, using TransformerBlock or an equivalent class is essential when creating a transformer.

Part 4: The init method (TransformerBlock/EncoderLayer)

# assume all the past libraries and modules have been imported
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1): 
        super(TransformerBlock, self).__init__() 
        self.att = MultiHeadSelfAttention(embed_dim, num_heads) 
        # ffn = feed-forward (neural) network
        self.ffn = tf.keras.Sequential([ 
            Dense(ff_dim, activation="relu"), 
            Dense(embed_dim), 
        ]) 
        # two layer normalization layers and two layer dropout layers as per the standard Transformer encoder layer design
        self.layernorm1 = LayerNormalization(epsilon=1e-6) 
        self.layernorm2 = LayerNormalization(epsilon=1e-6) 
        self.dropout1 = Dropout(rate) 
        self.dropout2 = Dropout(rate)

Part 5: The call method (TransformerBlock/EncoderLayer)

    # continuation from last code snippet
    def call(self, inputs, training): 
        # The first set of layer norm and dropout layers is applied after the self-attention...
        attn_output = self.att(inputs) 
        attn_output = self.dropout1(attn_output, training=training) 
        out1 = self.layernorm1(inputs + attn_output) 
        # While the second set is applied after the ffn.
        ffn_output = self.ffn(out1) 
        ffn_output = self.dropout2(ffn_output, training=training) 
        return self.layernorm2(out1 + ffn_output)

Then, we can duplicate the TransformerBlock, aka EncoderLayer, however many times we need to (using a for loop) in a list/array, and then slap on a final dropout layer and/or a final normalization layer (in this example, there isn’t one), and that’s our transformer completed!

Part 6: The TransformerEncoder (whole model)

class TransformerEncoder(Layer): 
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, rate=0.1): 
        super(TransformerEncoder, self).__init__() 
        self.num_layers = num_layers 
        self.embed_dim = embed_dim 
        # "for" loop for duplicating layers
        self.enc_layers = [TransformerBlock(embed_dim, num_heads, ff_dim, rate) for _ in range(num_layers)] 
        # final dropout layer
        self.dropout = Dropout(rate) 

    def call(self, inputs, training=False): 
        x = inputs 
        for i in range(self.num_layers): 
            x = self.enc_layers[i](x, training=training) 
        return x

Lessons Learned

I learned what the heck a “transformer” is (btw, the “T” in ChatGPT stands for “transformer”).
Also, I learned how, because of how they leverage parallelization, transformers are better than RNNs and LSTMs in some cases.
Transformer models have three parts to them: the encoder, the hidden layer, and the decoder. Edit: this is not entirely true
Each layer in the encoder has two basic parts: (multi-head) self-attention, and the feed-forward neural network.

Resources

Course I followed:

Deep Learning with Keras and TensorFlow

Offered by IBM. Deep learning is revolutionizing many fields, including computer vision, natural language processing, and robotics. In ... Enroll for free.

www.coursera.org/learn/building-deep-learning-models-with-tensorflow

Really good video that explains Multihead Attention: