A Gentle Introduction to LLM Architectures - Encoder, Decoder, and Encoder-Decoder Models

Large Language Models (LLMs) have become a cornerstone of modern AI, powering applications ranging from chatbots and translators to code assistants and search engines. At the heart of most LLMs lies the Transformer architecture, introduced by Vaswani et al. in 2017, introducing the encoder-decoder structure.

But not all LLMs are structured the same way. In this post, we’ll outline the three main types of Transformer architectures:

  • Encoder-only
  • Decoder-only
  • Encoder-Decoder

Understanding these structures helps explain why some models excel at classification, while others are better suited for generation or translation.

🔍 Encoder-only Models

Encoder-only models process input sequences to generate contextualized representations. These are typically used for understanding tasks, such as:

  • Text classification
  • Embedding generation (e.g., for retrieval)

Input-output Structure

  • The entire input sequence is processed simultaneously.
  • Each token attends to all others (bidirectional attention).
  • Outputs are contextual embeddings for each token or the full sentence.

Examples

  • BERT (Bidirectional Encoder Representations from Transformers)
  • RoBERTa, DistilBERT, E5

Use Cases

  • Semantic search
  • Sentiment analysis
  • Question answering (retrieval-based)

✍️ Decoder-only Models

Decoder-only models are causal: they generate text one token at a time, attending only to past tokens (left-to-right). This makes them ideal for language modeling and text generation.

Input-Output Structure

  • Input is processed autoregressively (each token only sees previous tokens).
  • Only self-attention is used (no separate encoder).

Examples

  • GPT (Generative Pretrained Transformer) family
  • LLaMA, Mistral, Gemma

Use Cases

  • Text generation
  • Code synthesis
  • Autocomplete
  • Instruction-following (e.g., ChatGPT)

🔁 Encoder-Decoder Models

Encoder-decoder models are often used in sequence-to-sequence tasks, where input and output are both important but distinct. Think of translating a sentence from English to French: the input must be understood, and a new sequence must be generated.

Structure

  • The encoder processes the input sequence into a set of contextual embeddings.
  • The decoder generates the output sequence, attending to both previous output tokens (causal) and the encoder output (cross-attention).

Examples

  • T5 (Text-To-Text Transfer Transformer)
  • BART
  • mT5, MarianMT

Use Cases

  • Machine translation
  • Summarization
  • Text rewriting
  • Instruction-based generation (with better controllability)

Choosing the right architecture depends on the task. For embedding-based retrieval, encoder-only works best. For free-form generation, decoder-only models dominate. For structured input-output tasks like translation or summarization, encoder-decoder models shine.


Some Questions

1. Why does ChatGPT (and most LLMs like GPT-3/4, Claude, LLaMA, etc.) use decoder-only structures instead of encoder-decoder structures?

Encoder-decoder models like T5, BART, and mT5 work very well on QA tasks — especially in supervised settings. But these models are less commonly used for interactive generation tasks like chat, because:

  • They need to encode the input first before decoding, making training less efficient.
  • They aren’t naturally autoregressive across an open-ended conversation.

2. How are these models trained?

  • Encoder-only: Mask language model traning (bidirectional), then add classification heads for downstream training tasks.
  • Decoder-only: Causal Language Modeling (Autoregressive), then train on (instruction, output) pairs or use RLHF.
  • Encoder-Decoder: Train the Encoder with MLM, the Decoder with CLM, and then train using labeled (input, output) examples.

We will dive into the structure and training of the models in more detail from the next post!