Transformer architecture diagram: a guide to how transformers work

A transformer architecture diagram shows how text, images, audio, or other input data becomes predictions inside a transformer model. The useful part is not the boxes alone. It is the flow: input tokens become vectors, vectors pass through attention and feed forward blocks, and the final linear layer turns the result into a probability distribution over possible outputs.

Transformers can look intimidating because the diagrams stack many ideas at once. Once you know what each arrow means, the transformer architecture becomes much easier to read.

Quick answer: how to read a transformer architecture diagram

A transformer architecture diagram is a visual map of how tokens move from raw text to output predictions. It shows how an input sequence is converted into vectors, refined through repeated layers, and turned into an output sequence or next token prediction.

Most diagrams show input embeddings, positional encoding, stacked encoder or decoder layers, attention mechanism blocks, feed forward layers, residual connections, layer normalization, and an output probability head. The arrows represent information flow through the model:

input sequence
embedding layer
positional encoding
repeated attention layer and feed forward network blocks
output logits
softmax probabilities

There are three common kinds of transformer diagrams. The full encoder decoder architecture from the 2017 paper was designed for machine translation. Encoder-only diagrams, such as BERT-style bidirectional encoder representations, are common for language processing tasks. Decoder-only diagrams describe many modern large language models, including GPT-style systems that generate text by predicting the next word or next token.

The sections below walk through each box in the diagram step by step, so you can reconstruct or understand almost any transformer architecture diagram you see.

The image depicts an abstract layered network composed of glowing blocks, with motion blur suggesting dynamic movement, symbolizing the complex interactions within a transformer architecture. This visual representation hints at key components like input sequences, multi-head attention, and the self-attention mechanism crucial for tasks in natural language processing and machine translation.

What is a transformer architecture diagram?

A transformer architecture diagram is a schematic that shows the key components of a transformer network and the flow of vectors through it. It usually starts with input text or another kind of input data, shows how that data becomes numerical vectors, and follows those vectors through multiple layers of attention and feed forward computation.

The canonical reference is the “Attention Is All You Need” figure from Vaswani et al. (2017), the paper that introduced the original transformer model for machine translation between languages such as English and German. The paper marked a major shift in artificial intelligence and natural language processing because it removed recurrence and made attention the central operation. You can read the original paper on arXiv.

The original transformer architecture consists of an encoder and a decoder, each made up of multiple layers that use self-attention mechanisms and feed-forward neural networks. That structure allows parallel processing of input data, unlike recurrent models that read tokens one step at a time.

A typical diagram separates the model into blocks: input embeddings, encoder layers, decoder layers, attention mechanism sublayers, position-wise feed forward networks, and a final projection plus softmax. A good diagram also shows what changes at each step: sequence length, vector dimension, masking, residual paths, and layer normalization.

Modern diagrams for large language models often look simpler because many LLMs use only the decoder stack. The conceptual building blocks are still there: embeddings, positional information, masked self attention, multi head attention, feed forward computation, residual connections, normalization, and an output head.

Why transformers matter in deep learning and NLP

Transformers are now central to deep learning systems for machine translation, summarization, question answering, code generation, search, speech recognition, image recognition, audio generation, protein structure prediction, game playing, and multimodal reasoning. Their main strength is that they can model relationships across an entire sequence without processing each item strictly in order.

Before transformers, many natural language processing systems used recurrent neural networks, especially RNNs and LSTMs. These models were built for sequential data, but they had clear limits. RNNs process data sequentially, which slows training and makes it harder to capture long range dependencies across many tokens. Transformers were developed to address those limits, especially the difficulty of remembering relationships between distant parts of a sequence.

The transformer architecture is designed to process input sequences in parallel, which significantly improves training efficiency compared with earlier recurrent neural networks. The self-attention mechanism allows transformer models to process entire sequences simultaneously, capturing long-range dependencies more effectively than previous architectures like RNNs, which process data sequentially.

Self-attention mechanisms enable transformers to perform parallel computations, which improves training speed and efficiency compared with RNNs that process inputs sequentially. This is one reason transformers became useful at large scale.

By 2018, the introduction of transformer models was widely treated as a watershed moment in natural language processing. Models such as BERT used the transformer architecture for improved contextual understanding and complex language understanding. The BERT paper, officially titled “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” is available on arXiv.

Modern large language models, including GPT-style systems and Llama-family models, can be understood as very deep stacks of transformer block layers. For a broader conceptual introduction to how these models work in practice, you can read an accessible beginner's guide to the transformer architecture in deep learning. Understanding the diagram helps explain why these systems are powerful and why they are computationally heavy: they rely on large embeddings, many matrix multiplications, repeated attention operations, and multiple layers of neural networks. Those same architectural choices also explain common performance issues in production, which is why resources on fixing slow LLMs and throughput bottlenecks focus on optimizing attention, batching, and memory usage during inference.

The classic encoder–decoder transformer diagram explained

The classic transformer architecture diagram from “Attention Is All You Need” is organized into two towers. The transformer encoder is usually shown on the left, and the transformer decoder is shown on the right. This layout fits machine translation: the encoder reads a source sentence, and the decoder produces a target sentence.

In an English-to-German translation model, the encoder processes the entire sequence of English input tokens. The encoder consists of multiple identical encoder layer blocks. Each encoder layer includes two main sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. Both are followed by layer normalization and residual connections to improve stability and convergence during training.

The encoder’s primary function is to transform input tokens into a high-dimensional representation that the decoder can use to generate the output sequence. In other words, the final encoder layer contains contextualized vectors that capture the meaning of each token with respect to the entire input sequence.

The decoder works differently. It receives the previously generated target tokens, such as the German words produced so far. Each decoder layer contains a causally masked self-attention mechanism, a cross-attention mechanism, and a feed forward network. The first attention block in the decoder uses masked self attention so a position cannot attend to future tokens.

In the decoder, the first multi-head attention layer uses masked self-attention to prevent positions from attending to subsequent positions. This ensures that predictions for a particular position depend only on known outputs at previous positions. Without that mask, a training diagram would let the model “see” the answer it is supposed to predict.

The next block is the encoder decoder attention mechanism, also called cross-attention. The decoder creates a query vector from its current state, while the key vectors and value vectors come from the encoder output. The decoder then uses attention weights to decide which source tokens are most relevant to the current output step. The corresponding value is mixed into the decoder representation.

The encoder-decoder architecture in transformers consists of multiple layers where the encoder processes input tokens to create contextualized representations, while the decoder generates output sequences based on these representations and previously generated tokens. The transformer model’s encoder-decoder structure allows for parallel processing of input sequences, which improves efficiency compared with earlier recurrent architectures.

From text to vectors: input tokens, embeddings, and positional information

The left-most part of a transformer architecture diagram explains how input text becomes numbers. A model cannot work directly with words, so raw text such as “Transformers changed AI.” is split into smaller units called tokens. These may be words, parts of words, punctuation marks, or byte-level pieces.

Tokenization maps each token to an integer ID from a fixed vocabulary learned before training. GPT-2, for example, used a vocabulary of more than 50,000 tokens. After tokenization, the sentence is no longer a string. It is a list of token IDs.

The embedding layer turns each token ID into a dense vector. These input embeddings may have dimensions such as 512, 768, 1024, or 4096, depending on the model. The vector is meant to capture semantic meaning: tokens used in similar contexts tend to be closer in embedding space.

At this point, the model knows what tokens are present, but it does not automatically know their order. Transformers utilize a positional encoding mechanism to provide information about the position of tokens in the input sequence, which is crucial because the architecture does not inherently capture sequential order like recurrent neural networks do.

A transformer diagram may show positional encoding as a separate block added to the token embeddings. The original transformer used sinusoidal positional encoding. Other models use learned positional embeddings. Many modern large language models use rotary position embeddings, often called RoPE, which apply position information inside the attention calculation rather than simply adding a vector.

The result is usually drawn as a matrix with shape similar to:

Dimension	Meaning
Sequence length	Number of tokens in the input sequence
Embedding dimension	Size of each token vector

That matrix flows into the first encoder layer, decoder layer, or decoder-only transformer block for further processing.

An abstract close-up view showcases a vibrant arrangement of small colored tiles that transform into glowing vectors, symbolizing the flow of input data through a transformer architecture. This visual metaphor captures the essence of complex language understanding and the self attention mechanism used in large language models.

Inside a transformer block: attention, feed forward, residuals, and norms

Every encoder layer or decoder layer expands into a transformer block with a repeated internal pattern. The exact layout varies, but the key components are attention, feed forward computation, residual connections, and layer normalization.

The self attention mechanism is the heart of the block, but it is not the whole model. In self-attention, each input token generates three vectors: query, key, and value. These are created by learned linear transformations of the token representation.

The query vector asks, “What am I looking for?” The key vectors say, “What information do I contain?” The value vectors carry the information that will be mixed into the output. The model compares queries with keys, converts those scores into attention weights, and uses those weights to combine values from other positions.

The self attention mechanism enables each token to look across the entire sequence and update its representation based on other relevant tokens. For example, in the sentence “The animal didn’t cross the street because it was tired,” self attention can help connect “it” with “animal” instead of “street.”

The multi-head attention mechanism in transformers allows the model to focus on different parts of the input sequence simultaneously, capturing various relationships and dependencies among tokens. This improves the model’s representational power. The multi-head attention mechanism in transformers extends self-attention by applying it multiple times in parallel, allowing the model to capture different aspects of relationships among tokens in a sequence.

In a diagram, this is often called multi head attention or multi head self attention. Each of the multiple attention heads has its own learned projections. One head may track syntax, another may track coreference, and another may focus on local phrase structure. The outputs of the attention heads are concatenated and passed through a linear layer.

After attention, the representation passes through a position-wise feed forward neural network. This is usually a two-layer MLP applied independently to each token position. A common pattern is to expand from d_model to 4 × d_model, apply an activation function, and project back to d_model.

Residual connections are the arrows that skip around sublayers. They add the input of a sublayer back to its output, which helps information and gradients flow through deep stacks. This design also helps reduce the vanishing gradient problem that made very deep neural networks harder to train.

Layer normalization stabilizes the values flowing through the model. Some diagrams show normalization after each sublayer, which matches the original post-norm layout. Many modern models use pre-norm, where normalization happens before attention or feed forward computation.

Self-attention and masked self-attention in the diagram

Self-attention blocks in encoder layers are unmasked. That means every position can attend to every other position in the input sequence because the full source sentence is already known.

Masked self attention appears in decoder layers and decoder-only LLM diagrams. The mask blocks future positions by setting illegal attention scores to negative infinity before softmax. After softmax, those future positions receive zero attention probability.

Diagrams may show this with a “mask” label, a shaded triangle, or the phrase “masked multi-head attention.” The visual detail matters because masking changes what information can flow through the model.

Masking is what lets a transformer decoder generate text left to right. When predicting the next token, the model can condition on previous tokens, but it cannot inspect the correct answer in a future position.

Feed forward networks and activation functions

The feed forward network in each layer processes each token vector independently. It does not mix information across positions. Attention handles token-to-token communication; the feed forward block transforms each token’s representation after that communication has happened.

A typical example is a model with embedding size 768 and feed forward dimension 3072. That 4× expansion makes the feed forward block a major contributor to the parameter count.

The original transformer used ReLU activation. Later large language models often use GELU, SwiGLU, or GEGLU. High-level diagrams often hide this detail and draw a single rectangle labeled “Feed Forward” or “Position-wise FFN.”

Encoder layers vs decoder layers: reading the structural differences

Encoder layers and decoder layers share multi-head self-attention, feed forward sublayers, residual connections, and layer normalization. The difference is that decoder layers add causal masking, and in the full encoder-decoder design, they also add cross-attention.

In encoder diagrams, the encoder stack is often shown vertically on the left. Each encoder layer contains self-attention followed by a feed forward neural network. There is no causal mask because the full input sequence is available from the start.

A decoder layer is usually drawn with three sublayers: masked self attention over the generated prefix, encoder-decoder attention over encoder outputs, and a feed forward block. This structure lets the transformer decoder use both the target tokens generated so far and the source sequence encoded by the encoder.

The phrase both the encoder and decoder can cause confusion because not every transformer model has both. The original transformer did. Many current systems do not.

Encoder-only models keep the encoder stack and remove the decoder. BERT-style transformers use bidirectional self-attention, so each token can attend to tokens on both sides. These models are useful for classification, named entity recognition, embeddings, retrieval, and other language processing tasks that require full context rather than left-to-right generation.

Decoder-only models remove the encoder and the encoder decoder attention mechanism. GPT-style large language models stack masked self-attention and feed forward blocks. Their primary function is next-token prediction: given previous tokens, predict the next token.

Diagram type	Main attention pattern	Common use
Encoder-only	Bidirectional self-attention	Classification, embeddings, search, understanding
Decoder-only	Causal masked self-attention	Text generation, chat, code generation
Encoder-decoder	Encoder self-attention plus decoder cross-attention	Translation, summarization, sequence-to-sequence tasks

Decoder-only LLM diagrams: how modern large language models are drawn

Most contemporary LLMs used in chatbots and code assistants are decoder-only transformers trained for next-token prediction on large text corpora. Their diagrams look simpler than the 2017 encoder-decoder figure because there is one tall stack instead of two towers.

A typical decoder-only transformer architecture diagram looks like this:

Tokenization splits the input text into tokens.
Token IDs become input embeddings.
Positional encoding, learned position embeddings, RoPE, or another method adds order information.
A stack of transformer blocks repeats masked self attention, MLP or feed forward computation, residual connections, and layer normalization.
The final hidden state passes through a final linear layer.
Softmax converts logits into a probability distribution over the vocabulary.
The model samples or selects the next token.

Masked self-attention is the only attention mechanism needed because the model learns the probability of each token given all previous tokens in the sequence. This is why decoder-only models can generate text, code, or other symbolic output step by step.

These models train on massive training data. During training, the model predicts the next word or token at each position. During generation, it feeds the chosen output token back into the context and repeats the process. At inference time, techniques such as continuous batching for LLM inference help keep GPUs busy while many of these next-token predictions are computed in parallel for different requests.

Modern decoder-only diagrams often label the output of the final layer as logits. Logits are raw scores over the vocabulary. Softmax turns those scores into probabilities, and decoding rules decide which token comes next. If you want to connect these diagram elements to real-world NLP systems, a guide du débutant sur l'architecture des transformateurs shows how self-attention, embeddings, and positional encoding appear in deployed transformer models.

Although decoder-only diagrams look cleaner, they still contain the same core concepts: attention mechanism, feed forward networks, residual connections, layer normalization, linear transformations, and stacked transformer blocks.

The image depicts an abstract stack of luminous layers ascending, with small particles flowing through them, symbolizing the complex interactions within a transformer architecture. This visual representation evokes the layers of a transformer model, reflecting the process of encoding and decoding input sequences in natural language processing.

How to read any transformer architecture diagram (practical checklist)

Use the diagram as a map of information flow. The goal is to trace what enters, what transforms, what connects, and what comes out.

Ask these questions in order:

Where does the input sequence enter?
Where does tokenization happen, if it is shown?
Where are input embeddings created?
Where is positional encoding added or applied?
Is the diagram showing an encoder-only, decoder-only, or encoder-decoder architecture?
Which boxes are self attention, masked self attention, or cross-attention?
Where are the feed forward blocks?
Are residual connections drawn as arrows around sublayers?
Where is layer normalization placed?
Does the model produce an output sequence, a classification label, or the next token?
Where are logits converted into a probability distribution?
Are any details hidden, such as dropout, KV caching, activation functions, or tensor shapes?

A practical way to read any transformer architecture diagram is to trace one token vector. Start with a token in the input text. Follow it into the embedding layer, then through positional encoding, then through an attention layer, then through feed forward computation. Repeat this mentally through multiple layers until it reaches the output head.

For encoder-decoder diagrams, trace one source token through the encoder, then watch how the decoder attends to the final encoder layer through cross-attention. For decoder-only diagrams, trace one generated token and check that it can attend only to previous tokens.

Why transformer diagrams vary and what they hide or reveal

“Transformer architecture” describes a family of models, not one fixed layout. The original transformer model, BERT, GPT-style systems, T5, vision transformers, speech models, and multimodal models all use related ideas but arrange them differently.

Vision transformers adapt the transformer architecture for computer vision by breaking input images into patches and treating those patches as sequences of tokens. This has worked well for tasks such as image classification and object detection. The original Vision Transformer paper is available on arXiv.

Transformers have also been applied beyond natural language processing, including audio generation, image recognition, protein structure prediction, and game playing. This broad use comes from the same basic pattern: turn input data into token-like vectors, process entire sequences with attention, and predict useful outputs.

Multimodal transformers, such as those used in models like GPT-5 and Gemini 3, can process multiple types of input data, including text, images, audio, and video, within a single architecture. This allows complex reasoning across different modalities, even though the high-level diagram may still look like stacked transformer blocks.

Simplified diagrams are helpful because they emphasize the main flow. They may show only embeddings, attention, feed forward blocks, and outputs. That makes them easier to read, especially for beginners.

Research and engineering diagrams are more detailed. They may show tensor shapes, attention masks, pre-layer normalization, post-layer normalization, dropout, KV caching, rotary embeddings, RMSNorm, SwiGLU, and parameter sharing. These details matter when you implement or debug a model, but they can overwhelm a first read.

The trade-off is simple: minimal diagrams clarify the path of information, while full diagrams reveal the exact computation. Once you understand embeddings, positional encoding, self attention, called multi head attention when repeated in parallel, feed forward blocks, residual connections, and normalization, you can interpret most variations.

A transformer architecture diagram is useful when it answers one question clearly: how does information move from input to prediction? If you can follow that path, the boxes stop looking like a wall of jargon and start working as a practical guide.

‍

Try Compute today

Your next workload belongs on Hivenet.

Pick one AI, compute, or storage workload and see the difference for yourself. Spin it up in minutes, or let our team map your fastest path to production.

Start now Contact sales

Check pricing Start building Talk through a workload

When AI students outgrow the sandbox: How DSTI expanded their GPU access with Hivenet

DSTI School of Engineering partnered with Hivenet to give master’s students more consistent access to affordable European GPU compute for real deep learning projects.