
A transformer architecture diagram shows how text, images, audio, or other input data becomes predictions inside a transformer model. The useful part is not the boxes alone. It is the flow: input tokens become vectors, vectors pass through attention and feed forward blocks, and the final linear layer turns the result into a probability distribution over possible outputs.
Transformers can look intimidating because the diagrams stack many ideas at once. Once you know what each arrow means, the transformer architecture becomes much easier to read.
A transformer architecture diagram is a visual map of how tokens move from raw text to output predictions. It shows how an input sequence is converted into vectors, refined through repeated layers, and turned into an output sequence or next token prediction.
Most diagrams show input embeddings, positional encoding, stacked encoder or decoder layers, attention mechanism blocks, feed forward layers, residual connections, layer normalization, and an output probability head. The arrows represent information flow through the model:
There are three common kinds of transformer diagrams. The full encoder decoder architecture from the 2017 paper was designed for machine translation. Encoder-only diagrams, such as BERT-style bidirectional encoder representations, are common for language processing tasks. Decoder-only diagrams describe many modern large language models, including GPT-style systems that generate text by predicting the next word or next token.
The sections below walk through each box in the diagram step by step, so you can reconstruct or understand almost any transformer architecture diagram you see.

A transformer architecture diagram is a schematic that shows the key components of a transformer network and the flow of vectors through it. It usually starts with input text or another kind of input data, shows how that data becomes numerical vectors, and follows those vectors through multiple layers of attention and feed forward computation.
The canonical reference is the “Attention Is All You Need” figure from Vaswani et al. (2017), the paper that introduced the original transformer model for machine translation between languages such as English and German. The paper marked a major shift in artificial intelligence and natural language processing because it removed recurrence and made attention the central operation. You can read the original paper on arXiv.
The original transformer architecture consists of an encoder and a decoder, each made up of multiple layers that use self-attention mechanisms and feed-forward neural networks. That structure allows parallel processing of input data, unlike recurrent models that read tokens one step at a time.
A typical diagram separates the model into blocks: input embeddings, encoder layers, decoder layers, attention mechanism sublayers, position-wise feed forward networks, and a final projection plus softmax. A good diagram also shows what changes at each step: sequence length, vector dimension, masking, residual paths, and layer normalization.
Modern diagrams for large language models often look simpler because many LLMs use only the decoder stack. The conceptual building blocks are still there: embeddings, positional information, masked self attention, multi head attention, feed forward computation, residual connections, normalization, and an output head.
Transformers are now central to deep learning systems for machine translation, summarization, question answering, code generation, search, speech recognition, image recognition, audio generation, protein structure prediction, game playing, and multimodal reasoning. Their main strength is that they can model relationships across an entire sequence without processing each item strictly in order.
Before transformers, many natural language processing systems used recurrent neural networks, especially RNNs and LSTMs. These models were built for sequential data, but they had clear limits. RNNs process data sequentially, which slows training and makes it harder to capture long range dependencies across many tokens. Transformers were developed to address those limits, especially the difficulty of remembering relationships between distant parts of a sequence.
The transformer architecture is designed to process input sequences in parallel, which significantly improves training efficiency compared with earlier recurrent neural networks. The self-attention mechanism allows transformer models to process entire sequences simultaneously, capturing long-range dependencies more effectively than previous architectures like RNNs, which process data sequentially.
Self-attention mechanisms enable transformers to perform parallel computations, which improves training speed and efficiency compared with RNNs that process inputs sequentially. This is one reason transformers became useful at large scale.
By 2018, the introduction of transformer models was widely treated as a watershed moment in natural language processing. Models such as BERT used the transformer architecture for improved contextual understanding and complex language understanding. The BERT paper, officially titled “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” is available on arXiv.
Modern large language models, including GPT-style systems and Llama-family models, can be understood as very deep stacks of transformer block layers. For a broader conceptual introduction to how these models work in practice, you can read an accessible beginner's guide to the transformer architecture in deep learning. Understanding the diagram helps explain why these systems are powerful and why they are computationally heavy: they rely on large embeddings, many matrix multiplications, repeated attention operations, and multiple layers of neural networks. Those same architectural choices also explain common performance issues in production, which is why resources on fixing slow LLMs and throughput bottlenecks focus on optimizing attention, batching, and memory usage during inference.
The classic transformer architecture diagram from “Attention Is All You Need” is organized into two towers. The transformer encoder is usually shown on the left, and the transformer decoder is shown on the right. This layout fits machine translation: the encoder reads a source sentence, and the decoder produces a target sentence.
In an English-to-German translation model, the encoder processes the entire sequence of English input tokens. The encoder consists of multiple identical encoder layer blocks. Each encoder layer includes two main sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. Both are followed by layer normalization and residual connections to improve stability and convergence during training.
The encoder’s primary function is to transform input tokens into a high-dimensional representation that the decoder can use to generate the output sequence. In other words, the final encoder layer contains contextualized vectors that capture the meaning of each token with respect to the entire input sequence.
The decoder works differently. It receives the previously generated target tokens, such as the German words produced so far. Each decoder layer contains a causally masked self-attention mechanism, a cross-attention mechanism, and a feed forward network. The first attention block in the decoder uses masked self attention so a position cannot attend to future tokens.
In the decoder, the first multi-head attention layer uses masked self-attention to prevent positions from attending to subsequent positions. This ensures that predictions for a particular position depend only on known outputs at previous positions. Without that mask, a training diagram would let the model “see” the answer it is supposed to predict.
The next block is the encoder decoder attention mechanism, also called cross-attention. The decoder creates a query vector from its current state, while the key vectors and value vectors come from the encoder output. The decoder then uses attention weights to decide which source tokens are most relevant to the current output step. The corresponding value is mixed into the decoder representation.
The encoder-decoder architecture in transformers consists of multiple layers where the encoder processes input tokens to create contextualized representations, while the decoder generates output sequences based on these representations and previously generated tokens. The transformer model’s encoder-decoder structure allows for parallel processing of input sequences, which improves efficiency compared with earlier recurrent architectures.
The left-most part of a transformer architecture diagram explains how input text becomes numbers. A model cannot work directly with words, so raw text such as “Transformers changed AI.” is split into smaller units called tokens. These may be words, parts of words, punctuation marks, or byte-level pieces.
Tokenization maps each token to an integer ID from a fixed vocabulary learned before training. GPT-2, for example, used a vocabulary of more than 50,000 tokens. After tokenization, the sentence is no longer a string. It is a list of token IDs.
The embedding layer turns each token ID into a dense vector. These input embeddings may have dimensions such as 512, 768, 1024, or 4096, depending on the model. The vector is meant to capture semantic meaning: tokens used in similar contexts tend to be closer in embedding space.
At this point, the model knows what tokens are present, but it does not automatically know their order. Transformers utilize a positional encoding mechanism to provide information about the position of tokens in the input sequence, which is crucial because the architecture does not inherently capture sequential order like recurrent neural networks do.
A transformer diagram may show positional encoding as a separate block added to the token embeddings. The original transformer used sinusoidal positional encoding. Other models use learned positional embeddings. Many modern large language models use rotary position embeddings, often called RoPE, which apply position information inside the attention calculation rather than simply adding a vector.
The result is usually drawn as a matrix with shape similar to:
That matrix flows into the first encoder layer, decoder layer, or decoder-only transformer block for further processing.

Every encoder layer or decoder layer expands into a transformer block with a repeated internal pattern. The exact layout varies, but the key components are attention, feed forward computation, residual connections, and layer normalization.
The self attention mechanism is the heart of the block, but it is not the whole model. In self-attention, each input token generates three vectors: query, key, and value. These are created by learned linear transformations of the token representation.
The query vector asks, “What am I looking for?” The key vectors say, “What information do I contain?” The value vectors carry the information that will be mixed into the output. The model compares queries with keys, converts those scores into attention weights, and uses those weights to combine values from other positions.
The self attention mechanism enables each token to look across the entire sequence and update its representation based on other relevant tokens. For example, in the sentence “The animal didn’t cross the street because it was tired,” self attention can help connect “it” with “animal” instead of “street.”
The multi-head attention mechanism in transformers allows the model to focus on different parts of the input sequence simultaneously, capturing various relationships and dependencies among tokens. This improves the model’s representational power. The multi-head attention mechanism in transformers extends self-attention by applying it multiple times in parallel, allowing the model to capture different aspects of relationships among tokens in a sequence.
In a diagram, this is often called multi head attention or multi head self attention. Each of the multiple attention heads has its own learned projections. One head may track syntax, another may track coreference, and another may focus on local phrase structure. The outputs of the attention heads are concatenated and passed through a linear layer.
After attention, the representation passes through a position-wise feed forward neural network. This is usually a two-layer MLP applied independently to each token position. A common pattern is to expand from d_model to 4 × d_model, apply an activation function, and project back to d_model.
Residual connections are the arrows that skip around sublayers. They add the input of a sublayer back to its output, which helps information and gradients flow through deep stacks. This design also helps reduce the vanishing gradient problem that made very deep neural networks harder to train.
Layer normalization stabilizes the values flowing through the model. Some diagrams show normalization after each sublayer, which matches the original post-norm layout. Many modern models use pre-norm, where normalization happens before attention or feed forward computation.
Self-attention blocks in encoder layers are unmasked. That means every position can attend to every other position in the input sequence because the full source sentence is already known.
Masked self attention appears in decoder layers and decoder-only LLM diagrams. The mask blocks future positions by setting illegal attention scores to negative infinity before softmax. After softmax, those future positions receive zero attention probability.
Diagrams may show this with a “mask” label, a shaded triangle, or the phrase “masked multi-head attention.” The visual detail matters because masking changes what information can flow through the model.
Masking is what lets a transformer decoder generate text left to right. When predicting the next token, the model can condition on previous tokens, but it cannot inspect the correct answer in a future position.
The feed forward network in each layer processes each token vector independently. It does not mix information across positions. Attention handles token-to-token communication; the feed forward block transforms each token’s representation after that communication has happened.
A typical example is a model with embedding size 768 and feed forward dimension 3072. That 4× expansion makes the feed forward block a major contributor to the parameter count.
The original transformer used ReLU activation. Later large language models often use GELU, SwiGLU, or GEGLU. High-level diagrams often hide this detail and draw a single rectangle labeled “Feed Forward” or “Position-wise FFN.”
Encoder layers and decoder layers share multi-head self-attention, feed forward sublayers, residual connections, and layer normalization. The difference is that decoder layers add causal masking, and in the full encoder-decoder design, they also add cross-attention.
In encoder diagrams, the encoder stack is often shown vertically on the left. Each encoder layer contains self-attention followed by a feed forward neural network. There is no causal mask because the full input sequence is available from the start.
A decoder layer is usually drawn with three sublayers: masked self attention over the generated prefix, encoder-decoder attention over encoder outputs, and a feed forward block. This structure lets the transformer decoder use both the target tokens generated so far and the source sequence encoded by the encoder.
The phrase both the encoder and decoder can cause confusion because not every transformer model has both. The original transformer did. Many current systems do not.
Encoder-only models keep the encoder stack and remove the decoder. BERT-style transformers use bidirectional self-attention, so each token can attend to tokens on both sides. These models are useful for classification, named entity recognition, embeddings, retrieval, and other language processing tasks that require full context rather than left-to-right generation.
Decoder-only models remove the encoder and the encoder decoder attention mechanism. GPT-style large language models stack masked self-attention and feed forward blocks. Their primary function is next-token prediction: given previous tokens, predict the next token.
Most contemporary LLMs used in chatbots and code assistants are decoder-only transformers trained for next-token prediction on large text corpora. Their diagrams look simpler than the 2017 encoder-decoder figure because there is one tall stack instead of two towers.
A typical decoder-only transformer architecture diagram looks like this:
Masked self-attention is the only attention mechanism needed because the model learns the probability of each token given all previous tokens in the sequence. This is why decoder-only models can generate text, code, or other symbolic output step by step.
These models train on massive training data. During training, the model predicts the next word or token at each position. During generation, it feeds the chosen output token back into the context and repeats the process. At inference time, techniques such as continuous batching for LLM inference help keep GPUs busy while many of these next-token predictions are computed in parallel for different requests.
Modern decoder-only diagrams often label the output of the final layer as logits. Logits are raw scores over the vocabulary. Softmax turns those scores into probabilities, and decoding rules decide which token comes next. If you want to connect these diagram elements to real-world NLP systems, a guide du débutant sur l'architecture des transformateurs shows how self-attention, embeddings, and positional encoding appear in deployed transformer models.
Although decoder-only diagrams look cleaner, they still contain the same core concepts: attention mechanism, feed forward networks, residual connections, layer normalization, linear transformations, and stacked transformer blocks.

Use the diagram as a map of information flow. The goal is to trace what enters, what transforms, what connects, and what comes out.
Ask these questions in order:
Une façon pratique de lire n'importe quel diagramme d'architecture de transformeur est de suivre un vecteur de token. Commencez par un token dans le texte d'entrée. Suivez-le dans la couche d'embedding, puis à travers l'encodage positionnel, puis à travers une couche d'attention, puis à travers le calcul de propagation avant. Répétez cela mentalement à travers plusieurs couches jusqu'à ce qu'il atteigne la tête de sortie.
Pour les diagrammes encodeur-décodeur, suivez un token source à travers l'encodeur, puis observez comment le décodeur s'attache à la dernière couche de l'encodeur via l'attention croisée. Pour les diagrammes décodeur seul, suivez un token généré et vérifiez qu'il ne peut s'attacher qu'aux tokens précédents.
L'« architecture de transformeur » décrit une famille de modèles, et non une disposition fixe. Le modèle de transformeur original, BERT, les systèmes de type GPT, T5, les transformeurs de vision, les modèles de parole et les modèles multimodaux utilisent tous des idées connexes mais les agencent différemment.
Les transformeurs de vision adaptent l'architecture de transformeur pour la vision par ordinateur en décomposant les images d'entrée en patchs et en traitant ces patchs comme des séquences de tokens. Cela a bien fonctionné pour des tâches telles que la classification d'images et la détection d'objets. L'article original sur les Vision Transformers est disponible sur arXiv.
Les transformeurs ont également été appliqués au-delà du traitement du langage naturel, notamment pour la génération audio, la reconnaissance d'images, la prédiction de structures protéiques et les jeux. Cette large utilisation découle du même schéma de base : transformer les données d'entrée en vecteurs de type token, traiter des séquences entières avec attention et prédire des sorties utiles.
Les transformeurs multimodaux, tels que ceux utilisés dans des modèles comme GPT-5 et Gemini 3, peuvent traiter plusieurs types de données d'entrée, y compris le texte, les images, l'audio et la vidéo, au sein d'une architecture unique. Cela permet un raisonnement complexe à travers différentes modalités, même si le diagramme de haut niveau peut toujours ressembler à des blocs de transformeurs empilés.
Les diagrammes simplifiés sont utiles car ils mettent l'accent sur le flux principal. Ils peuvent ne montrer que les embeddings, l'attention, les blocs de propagation avant et les sorties. Cela les rend plus faciles à lire, surtout pour les débutants.
Les schémas de recherche et d'ingénierie sont plus détaillés. Ils peuvent montrer les formes des tenseurs, les masques d'attention, la normalisation pré-couche, la normalisation post-couche, le dropout, la mise en cache KV, les embeddings rotatifs, RMSNorm, SwiGLU et le partage de paramètres. Ces détails sont importants lors de l'implémentation ou du débogage d'un modèle, mais ils peuvent être accablants lors d'une première lecture.
Le compromis est simple : les schémas minimaux clarifient le chemin de l'information, tandis que les schémas complets révèlent le calcul exact. Une fois que vous comprenez les embeddings, l'encodage positionnel, l'auto-attention (appelée attention multi-têtes lorsqu'elle est répétée en parallèle), les blocs feed-forward, les connexions résiduelles et la normalisation, vous pouvez interpréter la plupart des variations.
Un diagramme d'architecture de transformeur est utile lorsqu'il répond clairement à une question : comment l'information passe-t-elle de l'entrée à la prédiction ? Si vous pouvez suivre ce chemin, les boîtes cessent de ressembler à un mur de jargon et commencent à fonctionner comme un guide pratique.
Pick one AI, compute, or storage workload and see the difference for yourself. Spin it up in minutes, or let our team map your fastest path to production.