The Hidden Layer
Posts
LeDL Week 6: Transformers, Part 2!

LeDL Week 6: Transformers, Part 2!

Week 17 of documenting my AI/ML learning journey (Feb 2 - Feb 8)

Brandon Kim
February 16, 2025

What was discussed last week…

I learned how a transformer roughly works and the three parts that make one: the encoder, the hidden layer, and the decoder.
I also learned how transformers compare against other model structures such as RNNs and LSTMs in how they perform in sequential contexts.

Thursday, February 6th

I was mistaken last week; NOT ALL transformers have the two parts that I mentioned: an encoder, and a decoder. With that, here is the explanation of a decoder:

The Decoder (Transformer Decoder)

The decoder part is similar to the encoder in the sense that they both use FFNNs (Feed-Forward Neural Networks) and self-attention mechanisms. Decoders generate output sequences based on the encoder's output or previous tokens. It is used in tasks that require generating sequences, such as text generation or translation.

To clarify, a transformer doesn’t need both an encoder and a decoder for it to function as a model. There are different uses for transformers that only use encoders, only use decoders or both. Here are the different use cases for each type of transformer:

Type of model	Usage (what they’re best at)	Example models
Encoder	Analysis tasks, such as text classification and sentiment analysis.	BERT
Decoder	Generation; “autoregressive” text generation to be more specific.	GPT (as in ChatGPT)
Encoder-Decoder	Sequence-to-sequence tasks, which is when the model analyzes input data (encoder) and then generates output data based on the input data (decoder).	T5 (Text-to-Text Transfer Transformer)

Encoder-Decoder transformers are special…

Encoder-Decoder transformer models may also have a cross-attention mechanism in the decoder of the model; this is for the encoder’s output. A cross-attention mechanism operates on two different input sequences: the source sequence (encoder output) and the target sequence (decoder input) to see relationships between elements of the two different sequences. As a result, this ensures that the model’s output will be contextually accurate and relevant.

What is a “token”?

You might’ve heard this term if you’ve been around a lot of transformers (e.g. ChatGPT).

When transformer models are dealing with text data, they look at the text in units called “tokens”, which are integers that can represent anything from a single character to whole words.

For example, a “token” in ChatGPT’s 4o mini model could represent a word, while a “token” in Microsoft’s Copilot model could represent a few characters, regardless of whether it forms a word or not. (I’m extrapolating; these token measurements may not be true)

Friday, February 7th

VITs!

What is a VIT? VIT stands for “Vision Transformers”, a type of transformer applied in computer vision, where the VIT treats the image as if it were a sequence (so that the transformer can process it) by dividing it into patches.

To format images in a way where VITs can understand them, a PatchEmbedding layer has to be implemented first:

Transformers can also be used in time-series data, a series of data collected or recorded at successive points in time. A common example of this is tracking stock market prices—it can be recorded at certain points in time, in real-time. Using transformers to capture long-term dependencies (a large memory for context) can prove to be useful when analyzing sequential data; it can also handle variable-length sequences and missing data.

The general format of a transformer for time-series prediction

Lessons Learned

I learned that transformers can be very diverse in their structure and usage; no wonder they’re applicable in many different fields!
Transformers, no matter their purpose, always have a TransformerBlock class with a MultiHeadAttention mechanism (or just any self-attention mechanism) and an ffn (feedforward neural network).
There are many processes that transformers use to help convert different kinds of data into data they can understand e.g. tokenization and patch embedding.

Resources

Course I followed:

Deep Learning with Keras and TensorFlow

Offered by IBM. Deep learning is revolutionizing many fields, including computer vision, natural language processing, and robotics. In ... Enroll for free.

www.coursera.org/learn/building-deep-learning-models-with-tensorflow