LeDL Week 7: The Technicality of Transformers

Week 18 of documenting my AI/ML learning journey (Feb 9 - Feb 15)

What was discussed last week…

  • I learned that transformers can be very diverse in their structure and usage; no wonder they’re applicable in many different fields!

  • Transformers, no matter their purpose, always have a TransformerBlock class with a MultiHeadAttention mechanism (or just any self-attention mechanism) and an ffn (feedforward neural network).

  • There are many processes that transformers use to help convert different kinds of data into data they can understand e.g. tokenization and patch embedding.

  • Tokens are integer representations (“IDs”) of pieces of text data, which could represent words, characters, or phrases (most likely words).

Sunday, February 9th

Today, the course had me take a step back from the power of transformers, to review other models that can handle sequential data such as:

  • RNNs (Recursive Neural Networks)

  • LSTMs (Long-Short Term Memory)

  • GRUs (Gated Recurrent Units)

To clarify, sequential data is data in which their order is important, such as music (audio recordings) and text.

There are many processes that sequential data models, not just transformers, use to process their data more effectively, and another one of these processes is tokenization, a process that breaks up input text data into tokens, which is followed by a process called text vectorization, which converts said tokens into a numerical format.

Wednesday, February 12th

Now, we get back into transformers; this time, in implementing transformers for text-generation tasks; their structure looks like this, for you people who like diagrams (me included):

I looked at the TextVectorization class, a process, that does the text vectorization.

# Preprocess the dataset 
vocab_size = 10000 
seq_length = 100

# Adapt TextVectorization to the full text 

# An object instance of TextVectorization is being made
vectorizer = TextVectorization(max_tokens=vocab_size, output_mode='int') 

# This method divides the input along its first dimension, which effectively slices the list with one element (i.e. text) into a dataset with one element per batch.
# In other words, this line performs tokenization, but not vectorization just yet:
text_ds = tf.data.Dataset.from_tensor_slices([text]).batch(1) 

vectorizer.adapt(text_ds) 

# Vectorizing the text
vectorized_text = vectorizer([text])[0]

The TextVectorization class

Models/objects of the TextVectorization class are preprocessing layers (specifically used for text input), primarily used to map words and sentences to integer sequences.

.adapt()

This function from TensorFlow analyzes the tokens in the dataset, counts their occurrences, and, based on these counts, builds a memory of the tokens, whether they represent characters or words that were seen, also known as a vocabulary.

Note that if a max_tokens parameter is specified in a TextVectorization object/model upon initialization (this model, named vectorizer, does specify one), then the layer’s vocabulary will only remember the max_tokens number of unique tokens.

Then, after generating input and target sequences for training the transformer model (code not shown), we build the model:

from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, LayerNormalization, Dropout
from tensorflow.keras.models import Model

# TransformerBlock (the anatomy of a single layer in the model)
class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim),
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# The TransformerModel
class TransformerModel(Model):  # Model is now properly imported
    def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_layers, seq_length):
        super(TransformerModel, self).__init__()
        # The embedding layer takes sparse, high-dimensional categorical data (like word indices) into dense, lower-dimensional vectors
        self.embedding = Embedding(vocab_size, embed_dim)
        self.pos_encoding = self.positional_encoding(seq_length, embed_dim)
        self.transformer_blocks = [TransformerBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers)]
        self.dense = Dense(vocab_size)

    # Positional Encoding code starts here
    def positional_encoding(self, seq_length, embed_dim):
        angle_rads = self.get_angles(np.arange(seq_length)[:, np.newaxis], np.arange(embed_dim)[np.newaxis, :], embed_dim)
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
        pos_encoding = angle_rads[np.newaxis, ...]
        return (pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, embed_dim):
        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(embed_dim))
        return pos * angle_rates
    # Positional Encoding code ends here

    # Assembly and execution of the model; "x" is the variable that's "running through the model", so to speak
    def call(self, inputs, training=False):
        seq_len = tf.shape(inputs)[1]
        x = self.embedding(inputs)
        x += self.pos_encoding[:, :seq_len, :]
        for transformer_block in self.transformer_blocks:
            x = transformer_block(x, training=training)  # Pass training argument correctly
        output = self.dense(x)
        return output

However, I never gave a thorough explanation of positional encoding, so I’ll explain it later.

Friday, February 14th

In fact, I never gave a full explanation for everything involved in a transformer that deals with text input!

To clarify, here are all of the factors that come into play for a text transformer:

Tokens (an int): The ID of a part of the text

Embeddings (a vector of doubles): The meaning of the word, definition-wise

Positional Encoding: Order of the words in the sentence/text (which word comes before or after)

Self-Attention: The context in which the word is in (e.g. “financial bank” vs “river bank”)

What is Positional Encoding?

Since a sentence can mean two completely different things due to their arrangement and order, say, “no that is good” vs “that is no good”.

Transformers see all words at once but have no built-in sense of order, so to understand the meanings behind the order of words between the two sentences, positional encoding does so, by providing a relative position for each token in the input sequence (of words).

Positional Encoding works by assigning figurative “coordinates that represent a “position” for each token in an input sequence, and “moving” them in directional shifts in a high-dimensional space, akin to how a point would move on a graph (but imagine a more multidimensional plane like 3D, 4D, 5D, etc). These “movements” follow a specific, simple pattern in direction and degree (how far it moves) and differ for each token depending on the order in which they come in the input sequence.

The first word “it” goes right, the second word “is” goes up, the third word “no” goes down, and the fourth word “good” goes left.

It’s also worth mentioning that these “movements” are often determined by also using sinusoidal functions (i.e. sine and cosine) to ensure that each position has a distinct but smoothly varying representation that helps the model process sequences without needing recurrence (see the video at the bottom of this post for a more thorough explanation).

Here is the positional encoding code again (reformatted):

    def positional_encoding(self, seq_length, embed_dim):
        angle_rads = self.get_angles(
            # Column vector for positions: "[ [0], [1], [2]...]"
            np.arange(seq_length)[:, np.newaxis], 
            # Row vector for the embedding vector dimensions: "[ [0, 1, 2]...]"
            np.arange(embed_dim)[np.newaxis, :], 
            embed_dim
        )
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
        pos_encoding = angle_rads[np.newaxis, ...]
        return (pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, embed_dim):
        # The calculation was broken up into two lines for cleanliness
        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(embed_dim))
        return pos * angle_rates
        # could also be written as:
        # return pos / np.power(10000, (2 * (i // 2)) / np.float32(embed_dim))

The get_angles function calculates the new “position” of each token, using the formula:

i = np.arange(embed_dim)[np.newaxis, :], d of model = embed_dim

embed_dim stands for “embedding dimension”, and is the number of numerical values (or features) used in an embedding for each token in a model.

Multi-Dimensional Madness: Deep Learning Loves These Visualizations

Confused by learning how there can be 3D, 10D, or even 100D representations in models?

It’s confusing to me too, but this dimensionality madness is for a good reason; over the years, dimensions have been a concept that has proven to be effective when data scientists and AI engineers have been searching for a way to visualize the organization and handle the complexities and computation in machine learning and deep learning models.

This is due in part to how life, in general, is very complex, whether it’s as simple as an image, or as complex as analyzing a tense situation at court, situations, entities, and phenomena have many different components and aspects to them—thus, models may have to use a high amount of dimensions to be able to analyze all factors important factors in a situation or an entity with efficiency and precision.

This, however, comes with a catch: the Curse of Dimensionality. This is when there are too many dimensions for a given entity, which can lead to:

  • Overfitting

  • Higher computational/hardware requirements

  • Sparse data due to the volume of the data space growing exponentially

So “techniques like PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), and autoencoders are used to reduce the number of dimensions. This mitigates the effects of the curse to help simplify models and improve interpretability” (perplexity.ai, 2025).

Lessons Learned

  • I learned how to implement transformers in generating text.

  • For transformers to do this, I also learned how transformers can perform text vectorization to differentiate contexts and meanings of words/tokens!

  • I also delved into why topics in ML and DL talk so much about dimensionality and having so much dimensions.

Resources

Course I followed:

Great video about positional encoding: