- The Hidden Layer
- Posts
- LeDL Week 6: Transformers, Part 2!
LeDL Week 6: Transformers, Part 2!
Week 17 of documenting my AI/ML learning journey (Feb 2 - Feb 8)
What was discussed last week…
I learned how a transformer roughly works and the three parts that make one: the encoder, the hidden layer, and the decoder.
I also learned how transformers compare against other model structures such as RNNs and LSTMs in how they perform in sequential contexts.
Thursday, February 6th
I was mistaken last week; NOT ALL transformers have the two parts that I mentioned: an encoder, and a decoder. With that, here is the explanation of a decoder:
The Decoder (Transformer Decoder)
The decoder part is similar to the encoder in the sense that they both use FFNNs (Feed-Forward Neural Networks) and self-attention mechanisms. Decoders generate output sequences based on the encoder's output or previous tokens. It is used in tasks that require generating sequences, such as text generation or translation.
To clarify, a transformer doesn’t need both an encoder and a decoder for it to function as a model. There are different uses for transformers that only use encoders, only use decoders or both. Here are the different use cases for each type of transformer:
Type of model | Usage (what they’re best at) | Example models |
---|---|---|
Encoder | Analysis tasks, such as text classification and sentiment analysis. | BERT |
Decoder | Generation; “autoregressive” text generation to be more specific. | GPT (as in ChatGPT) |
Encoder-Decoder | Sequence-to-sequence tasks, which is when the model analyzes input data (encoder) and then generates output data based on the input data (decoder). | T5 (Text-to-Text Transfer Transformer) |
Encoder-Decoder transformers are special…

Encoder-Decoder transformer models may also have a cross-attention mechanism in the decoder of the model; this is for the encoder’s output. A cross-attention mechanism operates on two different input sequences: the source sequence (encoder output) and the target sequence (decoder input) to see relationships between elements of the two different sequences. As a result, this ensures that the model’s output will be contextually accurate and relevant.
What is a “token”?
You might’ve heard this term if you’ve been around a lot of transformers (e.g. ChatGPT).
When transformer models are dealing with text data, they look at the text in units called “tokens”, which are integers that can represent anything from a single character to whole words.
For example, a “token” in ChatGPT’s 4o mini model could represent a word, while a “token” in Microsoft’s Copilot model could represent a few characters, regardless of whether it forms a word or not. (I’m extrapolating; these token measurements may not be true)
Friday, February 7th
VITs!
What is a VIT? VIT stands for “Vision Transformers”, a type of transformer applied in computer vision, where the VIT treats the image as if it were a sequence (so that the transformer can process it) by dividing it into patches.
To format images in a way where VITs can understand them, a PatchEmbedding
layer has to be implemented first:

Transformers can also be used in time-series data, a series of data collected or recorded at successive points in time. A common example of this is tracking stock market prices—it can be recorded at certain points in time, in real-time. Using transformers to capture long-term dependencies (a large memory for context) can prove to be useful when analyzing sequential data; it can also handle variable-length sequences and missing data.

The general format of a transformer for time-series prediction
Lessons Learned
I learned that transformers can be very diverse in their structure and usage; no wonder they’re applicable in many different fields!
Transformers, no matter their purpose, always have a TransformerBlock class with a MultiHeadAttention mechanism (or just any self-attention mechanism) and an ffn (feedforward neural network).
There are many processes that transformers use to help convert different kinds of data into data they can understand e.g. tokenization and patch embedding.
Resources
Course I followed: