arXiv:2502.XXXXX [cs.CL]
Simulating the Attention Mechanism in Large Language Models Based on the GPT-2 Architecture
1 Independent AI Researcher
Correspondence: rominur@gmail.com
February 2025
Interactive simulation: https://simulasillm.vercel.app
Abstract
This paper presents an educational study of the attention mechanism in Large Language Models (LLMs) based on the Transformer architecture, with a particular focus on GPT-2 Small. Through an interactive web-based simulation, we demonstrate the complete text processing pipeline, including token embedding, Query, Key, and Value (Q/K/V) projections, masked self-attention with causal masking, softmax-based probability distribution, and autoregressive text generation. The simulation exposes tunable parameters such as temperature, top-k sampling, and generation length, thereby enabling an intuitive understanding of how language models produce text token by token. Our results suggest that interactive visualization can effectively clarify abstract concepts within the Transformer architecture that are otherwise difficult to grasp from mathematical descriptions alone.
Keywords: Large Language Model, Transformer, Attention Mechanism, GPT-2, Self-Attention, Token Embedding, Autoregressive Generation
1. Introduction
The rapid advancement of Large Language Models (LLMs) has fundamentally transformed the field of Natural Language Processing (NLP). Models such as GPT-2 [2], BERT [3], and their successors have demonstrated remarkable capabilities in both understanding and generating coherent text. The foundation underlying these models is the Transformer architecture [1], introduced by Vaswani et al. in 2017.
At the core of the Transformer architecture lies the self-attention mechanism, which enables each token in a sequence to attend to all other tokens simultaneously. This mechanism represents a fundamental departure from sequential architectures such as RNNs and LSTMs, which process tokens one at a time and are therefore inherently limited in capturing long-range dependencies.
Although the concept of attention has been extensively discussed in the literature, developing an intuitive understanding of how each internal component operates remains a significant challenge. This paper aims to elucidate the attention mechanism through an interactive simulation approach [6], with a focus on the GPT-2 Small architecture comprising 12 Transformer layers.
The principal contributions of this work are threefold: (1) a systematic explanation of the token processing pipeline within the GPT-2 architecture, (2) interactive visualizations for each stage of the attention computation, and (3) a demonstration of the effect of sampling parameters on the quality of generated text.
2. Token Embeddings
The first stage in text processing by an LLM involves converting raw input text into numerical representations amenable to computation. This process consists of two principal steps: tokenization and embedding.
2.1 Tokenization
Tokenization is the process of decomposing text into smaller units called tokens. GPT-2 employs Byte Pair Encoding (BPE), which segments text at the subword level rather than strictly at word boundaries. For example, the word "understanding" may be decomposed into the tokens ["under", "standing"]. Each token is mapped to a unique index within the model's vocabulary.
2.2 Embedding Vectors
Each indexed token is subsequently transformed into a high-dimensional vector via an embedding lookup table. In GPT-2 Small, each token is represented as a vector of dimension 768. Formally, for a token with index i, the embedding vector is obtained as:
where We denotes the embedding matrix of size |V| × dmodel, and |V| = 50,257 for GPT-2.
2.3 Positional Encoding
Since the attention mechanism has no inherent notion of token ordering, positional information must be injected explicitly. GPT-2 uses learned positional embeddings, where each position has a trainable embedding vector:
where pi is the positional embedding for position i, and hi(0) is the resulting input representation passed to the first Transformer layer.
3. Query, Key, and Value Projections
Once tokens have been represented as embedding vectors, the next critical step is to compute self-attention. Each input vector is projected into three distinct vector spaces: Query (Q), Key (K), and Value (V).
3.1 Linear Projections
For each attention head h, projections are computed using learned weight matrices:
where X is the input matrix (token sequence), and the projection matrices map from dmodel to dk. For GPT-2 Small with 12 attention heads, dk = 768/12 = 64.
3.2 Intuition Behind Q/K/V
Intuitively, the three projections can be understood as follows:
- Query (Q) — represents the question posed by a token: "What information do I need?"
- Key (K) — represents the identity offered by a token: "What information do I have?"
- Value (V) — represents the actual content carried by a token: "Here is the information I carry."
The compatibility between a Query and a Key determines the attention weight assigned, and the Value provides the information aggregated according to those weights.
4. Masked Self-Attention
Self-attention allows each token to access information from the entire sequence. However, in autoregressive models such as GPT-2, a token must only attend to tokens that precede it. This constraint is enforced through causal masking.
4.1 Scaled Dot-Product Attention
Attention scores are computed as the dot product of Query and Key vectors, normalized by the square root of the key dimension:
The scaling factor prevents the dot products from growing excessively large in magnitude, which would cause the softmax gradients to become vanishingly small.
4.2 Causal Mask
The mask matrix M is an upper-triangular matrix with −∞ at positions where the model is not permitted to attend:
When −∞ is added to the attention logits, the softmax function yields a probability approaching zero for those positions, effectively preventing the model from attending to future tokens.
4.3 Multi-Head Attention
GPT-2 Small employs 12 attention heads operating in parallel. Each head computes attention independently, enabling the model to capture diverse types of inter-token relationships simultaneously:
where each headi = Attention(Qi, Ki, Vi), and WO is the output projection matrix.
5. Softmax and Probability Distribution
After passing through all Transformer layers (12 in GPT-2 Small), the final hidden representation of the last token is used to predict a probability distribution over the entire vocabulary.
5.1 Language Modeling Head
The output representation from the final layer is projected into the vocabulary space via a matrix multiplication:
where hn(L) is the hidden state at the last position from layer L, and WeT is the transpose of the embedding matrix (weight tying). The resulting vector z is referred to as the logits.
5.2 Softmax Function
The logits are converted into a probability distribution using the softmax function:
where T denotes the temperature parameter. In the simulation, this probability distribution is visualized in real time, displaying the candidate next tokens along with their associated probabilities.
6. Autoregressive Sampling
Text generation in autoregressive models proceeds token by token. Each generated token is appended to the input sequence and used to predict the subsequent token. This process continues until either the desired sequence length is reached or a special end-of-sequence token is produced.
6.1 Temperature
The temperature parameter (T) controls the sharpness of the probability distribution:
- T → 0: the distribution becomes increasingly peaked (deterministic), favoring the highest-probability token.
- T = 1.0: the probability distribution is unmodified (default behavior).
- T > 1.0: the distribution becomes flatter (more stochastic), increasing output diversity.
In the simulation, temperature is set to a default value of 0.8, which provides a balance between coherence and diversity.
6.2 Top-k Sampling
Top-k sampling [5] restricts token selection to the k tokens with the highest probabilities. Tokens outside the top-k are eliminated by setting their probabilities to zero, after which the distribution is renormalized:
With k = 5 in the simulation, only the five highest-probability tokens are considered as candidates, thereby reducing the risk of generating irrelevant or incoherent tokens.
6.3 The Autoregressive Process
Formally, autoregressive generation produces a sequence of tokens (w1, w2, ..., wT) where each token is sampled from:
The selected token is appended to the input sequence and the process repeats for the next position. In the simulation, this process is visualized under the "Generated Continuation" panel.
7. GPT-2 Small Architecture
The simulation is modeled on the GPT-2 Small architecture, the smallest variant in the GPT-2 family. Table 1 summarizes the architectural specifications.
Table 1. GPT-2 Small architectural parameters.
| Parameter | Value | Description |
|---|---|---|
| Number of Layers | 12 | Transformer decoder blocks |
| Model Dimension (dmodel) | 768 | Hidden state size |
| Attention Heads | 12 | Per-layer multi-head attention |
| Key Dimension (dk) | 64 | 768 / 12 heads |
| Vocabulary Size | 50,257 | BPE tokens |
| Context Length | 1,024 | Maximum tokens per sequence |
| Causal Mask | Active | Autoregressive masking |
| Total Parameters | ~117M | 117 million trainable parameters |
Each Transformer block consists of: (1) masked multi-head self-attention, (2) layer normalization, (3) a position-wise feed-forward network (MLP with GELU activation), and (4) residual connections bridging the input and output of each sub-layer.
8. Interactive Simulation
The interactive simulation deployed at simulasillm.vercel.app provides a comprehensive visualization of each processing stage in the GPT-2 Small architecture [6].
8.1 Visualization Components
The simulation comprises six principal components:
- Embedding View — displays the input token stream and their corresponding embedding vector representations.
- Attention Core — visualizes the Q/K/V maps and the masked self-attention matrix in real time.
- Probability Distribution — renders the next-token probability distribution after softmax.
- Query Token Details — shows the detailed attention weights for the currently active query token.
- Q/K/V Snapshot — presents the numerical vector values for a selected attention head.
- Generated Continuation — displays the autoregressive sampling output showing the generated text.
8.2 Tunable Parameters
Users can adjust the following parameters:
- Temperature (default: 0.8) — controls the output probability distribution sharpness.
- Top-k (default: 5) — the number of highest-probability candidate tokens to consider.
- Generation length (default: 6 tokens) — the number of tokens to generate autoregressively.
8.3 Educational Significance
The simulation bridges the gap between formal mathematical descriptions and intuitive understanding. By visualizing the internal operations of the model interactively, users can observe how changes to a single component (e.g., temperature) propagate through the entire text generation pipeline—an insight that is difficult to obtain from equations or pseudocode alone.
9. Conclusion
This paper has presented a comprehensive explanation of the attention mechanism in the GPT-2 Small architecture through an interactive simulation. We have discussed in detail each processing stage, from token embeddings and Q/K/V projections to masked self-attention and autoregressive sampling.
The key findings are as follows:
- Interactive visualization significantly enhances understanding of internal Transformer operations, particularly the concepts of causal masking and multi-head attention.
- Sampling parameters (temperature and top-k) have a substantial influence on the quality and diversity of generated text, and their effects become considerably more transparent through direct experimentation within the simulation.
- Simulation-based educational approaches can serve as an effective complement to conventional learning methods for understanding complex deep learning architectures.
The simulation is publicly available at simulasillm.vercel.app. Future work may include extending the visualization to cover layer normalization, residual connections, and comparative analysis with other Transformer variants.
References
- [1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
- [2] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- [3] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019. https://arxiv.org/abs/1810.04805
- [4] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. https://arxiv.org/abs/1310.4546
- [5] Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. Proceedings of ICLR 2020. https://arxiv.org/abs/1904.09751
- [6] Simulasi LLM (2025). LLM Attention Simulation (GPT-2 Style). Interactive Web Simulation. https://simulasillm.vercel.app/