arXiv:2502.XXXXX [cs.CL]

Simulating the Attention Mechanism in Large Language Models Based on the GPT-2 Architecture

Romi Nur Ismanto1

1 Independent AI Researcher

Correspondence: rominur@gmail.com

February 2025

Interactive simulation: https://simulasillm.vercel.app


Abstract

This paper presents an educational study of the attention mechanism in Large Language Models (LLMs) based on the Transformer architecture, with a particular focus on GPT-2 Small. Through an interactive web-based simulation, we demonstrate the complete text processing pipeline, including token embedding, Query, Key, and Value (Q/K/V) projections, masked self-attention with causal masking, softmax-based probability distribution, and autoregressive text generation. The simulation exposes tunable parameters such as temperature, top-k sampling, and generation length, thereby enabling an intuitive understanding of how language models produce text token by token. Our results suggest that interactive visualization can effectively clarify abstract concepts within the Transformer architecture that are otherwise difficult to grasp from mathematical descriptions alone.

Keywords: Large Language Model, Transformer, Attention Mechanism, GPT-2, Self-Attention, Token Embedding, Autoregressive Generation


1. Introduction

The rapid advancement of Large Language Models (LLMs) has fundamentally transformed the field of Natural Language Processing (NLP). Models such as GPT-2 [2], BERT [3], and their successors have demonstrated remarkable capabilities in both understanding and generating coherent text. The foundation underlying these models is the Transformer architecture [1], introduced by Vaswani et al. in 2017.

At the core of the Transformer architecture lies the self-attention mechanism, which enables each token in a sequence to attend to all other tokens simultaneously. This mechanism represents a fundamental departure from sequential architectures such as RNNs and LSTMs, which process tokens one at a time and are therefore inherently limited in capturing long-range dependencies.

Although the concept of attention has been extensively discussed in the literature, developing an intuitive understanding of how each internal component operates remains a significant challenge. This paper aims to elucidate the attention mechanism through an interactive simulation approach [6], with a focus on the GPT-2 Small architecture comprising 12 Transformer layers.

The principal contributions of this work are threefold: (1) a systematic explanation of the token processing pipeline within the GPT-2 architecture, (2) interactive visualizations for each stage of the attention computation, and (3) a demonstration of the effect of sampling parameters on the quality of generated text.


2. Token Embeddings

The first stage in text processing by an LLM involves converting raw input text into numerical representations amenable to computation. This process consists of two principal steps: tokenization and embedding.

2.1 Tokenization

Tokenization is the process of decomposing text into smaller units called tokens. GPT-2 employs Byte Pair Encoding (BPE), which segments text at the subword level rather than strictly at word boundaries. For example, the word "understanding" may be decomposed into the tokens ["under", "standing"]. Each token is mapped to a unique index within the model's vocabulary.

2.2 Embedding Vectors

Each indexed token is subsequently transformed into a high-dimensional vector via an embedding lookup table. In GPT-2 Small, each token is represented as a vector of dimension 768. Formally, for a token with index i, the embedding vector is obtained as:

ei=We[i]Rdmodele_i = W_e[i] \in \mathbb{R}^{d_{\text{model}}}(1)

where We denotes the embedding matrix of size |V| × dmodel, and |V| = 50,257 for GPT-2.

2.3 Positional Encoding

Since the attention mechanism has no inherent notion of token ordering, positional information must be injected explicitly. GPT-2 uses learned positional embeddings, where each position has a trainable embedding vector:

hi(0)=ei+pih_i^{(0)} = e_i + p_i(2)

where pi is the positional embedding for position i, and hi(0) is the resulting input representation passed to the first Transformer layer.


3. Query, Key, and Value Projections

Once tokens have been represented as embedding vectors, the next critical step is to compute self-attention. Each input vector is projected into three distinct vector spaces: Query (Q), Key (K), and Value (V).

3.1 Linear Projections

For each attention head h, projections are computed using learned weight matrices:

Qh=XWhQ,Kh=XWhK,Vh=XWhVQ_h = X W_h^Q, \quad K_h = X W_h^K, \quad V_h = X W_h^V(3)

where X is the input matrix (token sequence), and the projection matrices map from dmodel to dk. For GPT-2 Small with 12 attention heads, dk = 768/12 = 64.

3.2 Intuition Behind Q/K/V

Intuitively, the three projections can be understood as follows:

  • Query (Q) — represents the question posed by a token: "What information do I need?"
  • Key (K) — represents the identity offered by a token: "What information do I have?"
  • Value (V) — represents the actual content carried by a token: "Here is the information I carry."

The compatibility between a Query and a Key determines the attention weight assigned, and the Value provides the information aggregated according to those weights.


4. Masked Self-Attention

Self-attention allows each token to access information from the entire sequence. However, in autoregressive models such as GPT-2, a token must only attend to tokens that precede it. This constraint is enforced through causal masking.

4.1 Scaled Dot-Product Attention

Attention scores are computed as the dot product of Query and Key vectors, normalized by the square root of the key dimension:

Attention(Q,K,V)=softmax ⁣(QKTdk+M)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V(4)

The scaling factor prevents the dot products from growing excessively large in magnitude, which would cause the softmax gradients to become vanishingly small.

4.2 Causal Mask

The mask matrix M is an upper-triangular matrix with −∞ at positions where the model is not permitted to attend:

Mij={0if jiif j>iM_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}(5)

When −∞ is added to the attention logits, the softmax function yields a probability approaching zero for those positions, effectively preventing the model from attending to future tokens.

Keys →ThecatsatonQueries →Thecatsaton1.00−∞−∞−∞0.300.70−∞−∞0.100.200.70−∞0.050.150.300.50Blue = high attention weight | Gray (−∞) = masked (causal mask)
Figure 1. Visualization of the attention matrix with causal masking. Blue cells indicate active attention weights, while gray cells (−∞) denote masked positions that prevent the model from attending to future tokens.

4.3 Multi-Head Attention

GPT-2 Small employs 12 attention heads operating in parallel. Each head computes attention independently, enabling the model to capture diverse types of inter-token relationships simultaneously:

MultiHead(Q,K,V)=Concat(head1,,head12)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_{12})\, W^O(6)

where each headi = Attention(Qi, Ki, Vi), and WO is the output projection matrix.


5. Softmax and Probability Distribution

After passing through all Transformer layers (12 in GPT-2 Small), the final hidden representation of the last token is used to predict a probability distribution over the entire vocabulary.

5.1 Language Modeling Head

The output representation from the final layer is projected into the vocabulary space via a matrix multiplication:

z=hn(L)WeTz = h_n^{(L)}\, W_e^T(7)

where hn(L) is the hidden state at the last position from layer L, and WeT is the transpose of the embedding matrix (weight tying). The resulting vector z is referred to as the logits.

5.2 Softmax Function

The logits are converted into a probability distribution using the softmax function:

P(wiw<i)=exp(zi/T)j=1Vexp(zj/T)P(w_i \mid w_{<i}) = \frac{\exp(z_i / T)}{\sum_{j=1}^{|V|} \exp(z_j / T)}(8)

where T denotes the temperature parameter. In the simulation, this probability distribution is visualized in real time, displaying the candidate next tokens along with their associated probabilities.

Example: Next-Token Probability Distribution

empowers users
54.7%
create
20.8%
see
12.1%
make
6.3%
other
6.1%

6. Autoregressive Sampling

Text generation in autoregressive models proceeds token by token. Each generated token is appended to the input sequence and used to predict the subsequent token. This process continues until either the desired sequence length is reached or a special end-of-sequence token is produced.

6.1 Temperature

The temperature parameter (T) controls the sharpness of the probability distribution:

  • T → 0: the distribution becomes increasingly peaked (deterministic), favoring the highest-probability token.
  • T = 1.0: the probability distribution is unmodified (default behavior).
  • T > 1.0: the distribution becomes flatter (more stochastic), increasing output diversity.

In the simulation, temperature is set to a default value of 0.8, which provides a balance between coherence and diversity.

6.2 Top-k Sampling

Top-k sampling [5] restricts token selection to the k tokens with the highest probabilities. Tokens outside the top-k are eliminated by setting their probabilities to zero, after which the distribution is renormalized:

P(wi)={P(wi)jtop-kP(wj)if witop-k0otherwiseP'(w_i) = \begin{cases} \dfrac{P(w_i)}{\sum_{j \in \text{top-}k} P(w_j)} & \text{if } w_i \in \text{top-}k \\[6pt] 0 & \text{otherwise} \end{cases}(9)

With k = 5 in the simulation, only the five highest-probability tokens are considered as candidates, thereby reducing the risk of generating irrelevant or incoherent tokens.

6.3 The Autoregressive Process

Formally, autoregressive generation produces a sequence of tokens (w1, w2, ..., wT) where each token is sampled from:

wtP(ww1,w2,,wt1)w_t \sim P(w \mid w_1, w_2, \ldots, w_{t-1})(10)

The selected token is appended to the input sequence and the process repeats for the next position. In the simulation, this process is visualized under the "Generated Continuation" panel.


7. GPT-2 Small Architecture

The simulation is modeled on the GPT-2 Small architecture, the smallest variant in the GPT-2 family. Table 1 summarizes the architectural specifications.

Table 1. GPT-2 Small architectural parameters.

ParameterValueDescription
Number of Layers12Transformer decoder blocks
Model Dimension (dmodel)768Hidden state size
Attention Heads12Per-layer multi-head attention
Key Dimension (dk)64768 / 12 heads
Vocabulary Size50,257BPE tokens
Context Length1,024Maximum tokens per sequence
Causal MaskActiveAutoregressive masking
Total Parameters~117M117 million trainable parameters
Input TokensToken Embedding + PositionTransformer Block (×12)Query (Q)Key (K)Value (V)Masked Self-AttentionAdd & LayerNormFeed-Forward NetworkAdd & LayerNormSoftmax → Token Probabilities
Figure 2. Transformer architecture diagram for GPT-2 Small. Input tokens are processed through an embedding layer, then pass through 12 Transformer blocks, each containing masked self-attention and a feed-forward network, before yielding a probability distribution over the next token.

Each Transformer block consists of: (1) masked multi-head self-attention, (2) layer normalization, (3) a position-wise feed-forward network (MLP with GELU activation), and (4) residual connections bridging the input and output of each sub-layer.


8. Interactive Simulation

The interactive simulation deployed at simulasillm.vercel.app provides a comprehensive visualization of each processing stage in the GPT-2 Small architecture [6].

8.1 Visualization Components

The simulation comprises six principal components:

  1. Embedding View — displays the input token stream and their corresponding embedding vector representations.
  2. Attention Core — visualizes the Q/K/V maps and the masked self-attention matrix in real time.
  3. Probability Distribution — renders the next-token probability distribution after softmax.
  4. Query Token Details — shows the detailed attention weights for the currently active query token.
  5. Q/K/V Snapshot — presents the numerical vector values for a selected attention head.
  6. Generated Continuation — displays the autoregressive sampling output showing the generated text.

8.2 Tunable Parameters

Users can adjust the following parameters:

  • Temperature (default: 0.8) — controls the output probability distribution sharpness.
  • Top-k (default: 5) — the number of highest-probability candidate tokens to consider.
  • Generation length (default: 6 tokens) — the number of tokens to generate autoregressively.

8.3 Educational Significance

The simulation bridges the gap between formal mathematical descriptions and intuitive understanding. By visualizing the internal operations of the model interactively, users can observe how changes to a single component (e.g., temperature) propagate through the entire text generation pipeline—an insight that is difficult to obtain from equations or pseudocode alone.


9. Conclusion

This paper has presented a comprehensive explanation of the attention mechanism in the GPT-2 Small architecture through an interactive simulation. We have discussed in detail each processing stage, from token embeddings and Q/K/V projections to masked self-attention and autoregressive sampling.

The key findings are as follows:

  1. Interactive visualization significantly enhances understanding of internal Transformer operations, particularly the concepts of causal masking and multi-head attention.
  2. Sampling parameters (temperature and top-k) have a substantial influence on the quality and diversity of generated text, and their effects become considerably more transparent through direct experimentation within the simulation.
  3. Simulation-based educational approaches can serve as an effective complement to conventional learning methods for understanding complex deep learning architectures.

The simulation is publicly available at simulasillm.vercel.app. Future work may include extending the visualization to cover layer normalization, residual connections, and comparative analysis with other Transformer variants.


References

  1. [1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
  2. [2] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Technical Report. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  3. [3] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019. https://arxiv.org/abs/1810.04805
  4. [4] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. https://arxiv.org/abs/1310.4546
  5. [5] Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. Proceedings of ICLR 2020. https://arxiv.org/abs/1904.09751
  6. [6] Simulasi LLM (2025). LLM Attention Simulation (GPT-2 Style). Interactive Web Simulation. https://simulasillm.vercel.app/