Large Language Models (LLMs) have become a cornerstone of modern AI, powering applications ranging from chatbots and translators to code assistants and search engines. At the heart of most LLMs lies the Transformer architecture, introduced by Vaswani et al. in 2017, introducing the encoder-decoder structure.

But not all LLMs are structured the same way. In this post, we’ll outline the three main types of Transformer architectures:

Encoder-only
Decoder-only
Encoder-Decoder

Understanding these structures helps explain why some models excel at classification, while others are better suited for generation or translation.

🔍 Encoder-only Models

Encoder-only models process input sequences to generate contextualized representations. These are typically used for understanding tasks, such as:

Text classification
Embedding generation (e.g., for retrieval)

Input-output Structure

The entire input sequence is processed simultaneously.
Each token attends to all others (bidirectional attention).
Outputs are contextual embeddings for each token or the full sentence.

Examples

BERT (Bidirectional Encoder Representations from Transformers)
RoBERTa, DistilBERT, E5

Use Cases

Semantic search
Sentiment analysis
Question answering (retrieval-based)

✍️ Decoder-only Models

Decoder-only models are causal: they generate text one token at a time, attending only to past tokens (left-to-right). This makes them ideal for language modeling and text generation.

Input-Output Structure

Input is processed autoregressively (each token only sees previous tokens).
Only self-attention is used (no separate encoder).

Examples

GPT (Generative Pretrained Transformer) family
LLaMA, Mistral, Gemma

Use Cases

Text generation
Code synthesis
Autocomplete
Instruction-following (e.g., ChatGPT)

🔁 Encoder-Decoder Models

Encoder-decoder models are often used in sequence-to-sequence tasks, where input and output are both important but distinct. Think of translating a sentence from English to French: the input must be understood, and a new sequence must be generated.

Structure

The encoder processes the input sequence into a set of contextual embeddings.
The decoder generates the output sequence, attending to both previous output tokens (causal) and the encoder output (cross-attention).

Examples

T5 (Text-To-Text Transfer Transformer)
BART
mT5, MarianMT

Use Cases

Machine translation
Summarization
Text rewriting
Instruction-based generation (with better controllability)

Choosing the right architecture depends on the task. For embedding-based retrieval, encoder-only works best. For free-form generation, decoder-only models dominate. For structured input-output tasks like translation or summarization, encoder-decoder models shine.

Some Questions

1. Why does ChatGPT (and most LLMs like GPT-3/4, Claude, LLaMA, etc.) use decoder-only structures instead of encoder-decoder structures?

Encoder-decoder models like T5, BART, and mT5 work very well on QA tasks — especially in supervised settings. But these models are less commonly used for interactive generation tasks like chat, because:

They need to encode the input first before decoding, making training less efficient.
They aren’t naturally autoregressive across an open-ended conversation.

2. How are these models trained?

Encoder-only: Mask language model traning (bidirectional), then add classification heads for downstream training tasks.
Decoder-only: Causal Language Modeling (Autoregressive), then train on (instruction, output) pairs or use RLHF.
Encoder-Decoder: Train the Encoder with MLM, the Decoder with CLM, and then train using labeled (input, output) examples.

We will dive into the structure and training of the models in more detail from the next post!