Engineering Blog
Scientific Research2026-04-10

Scaling in Transformer Architectures: The Mathematical Rationale behind $\sqrt{d_k}$

Scientific Research Team|Industrial Case Study

Softmax Distribution with and without Scaling

In the architecture of Transformers, the self-attention mechanism is defined by the Scaled Dot-Product Attention. While the dot product measures the similarity between Query (QQ) and Key (KK) vectors, the division by dk\sqrt{d_k} is not a heuristic convenience but a critical stabilization required for numerical optimization.

This paper formalizes the relationship between vector dimensionality, variance explosion, and the subsequent saturation of the softmax function.


1. The Variance of High-Dimensional Dot Products

Consider two vectors, a Query qq and a Key kk, both of dimension dkd_k. The dot product is defined as:

qk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i

Under standard initialization weights (e.g., Xavier or Kaiming initialization), we assume that the components qiq_i and kik_i are independent random variables with a mean of zero and a variance of one:

  • E[qi]=E[ki]=0E[q_i] = E[k_i] = 0
  • Var(qi)=Var(ki)=1Var(q_i) = Var(k_i) = 1

1.1 Mathematical Derivation

The variance of the product of two independent random variables qiq_i and kik_i with zero mean is: Var(qiki)=E[qi2ki2](E[qiki])2=E[qi2]E[ki2]0=1×1=1Var(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = E[q_i^2]E[k_i^2] - 0 = 1 \times 1 = 1

Since the components of the vectors are independent, the variance of their sum (the dot product) is the sum of their variances: Var(qk)=i=1dkVar(qiki)=dk×1=dkVar(q \cdot k) = \sum_{i=1}^{d_k} Var(q_i k_i) = d_k \times 1 = d_k

Conclusion: As the dimensionality dkd_k increases, the variance of the dot product grows linearly with dkd_k. For modern models where dk=64d_k = 64 or 128128, the resulting values often fall into extreme ranges (e.g., +50,60+50, -60).


2. The Softmax Saturation Problem

The output of the dot product is passed into the softmax function to translate similarity scores into a probability distribution:

Attention(Q^,K,V)=softmax(S)V, where Sij=qikjscaleAttention(\hat{Q}, K, V) = \text{softmax}(S)V \text{, where } S_{ij} = \frac{q_i \cdot k_j}{\text{scale}}

The softmax function's behavior is dictated by the relative differences between its inputs. When the variance of the input SS is large, the values eSie^{S_i} diverge exponentially.

2.1 Gradient Vanishing

If the scores are large, the softmax function "saturates," producing a distribution that approaches a one-hot vector. In this saturated state, the local gradients of the softmax function become extremely small:

  • When xx is large, softmax(x)x0\frac{\partial \text{softmax}(x)}{\partial x} \approx 0.

During backpropagation, these near-zero gradients effectively halt the updates to the weight matrices WQW_Q and WKW_K, leading to the Vanishing Gradient problem and preventing the model from converging.


3. Stabilization via dk\sqrt{d_k} Scaling

To maintain the sensitivity of the softmax function, we must ensure that the variance of the input to the softmax remains independent of the dimension dkd_k.

By applying a scaling factor of c=1dkc = \frac{1}{\sqrt{d_k}}, we utilize the property of variance Var(cX)=c2Var(X)Var(cX) = c^2 Var(X):

Var(qkdk)=(1dk)2Var(qk)=1dk×dk=1Var\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \left(\frac{1}{\sqrt{d_k}}\right)^2 Var(q \cdot k) = \frac{1}{d_k} \times d_k = 1

This normalization ensures that the input to the softmax is roughly unit variance regardless of how large the model’s internal hidden dimensions grow.


4. Performance Comparison

Attribute Unscaled Dot-Product Scaled Dot-Product (1/dk1/\sqrt{d_k})
Input Variance dkd_k (High) 1.01.0 (Controlled)
Softmax State Saturated (Approaching One-Hot) Smooth / Distributed
Gradient Flow Vanishing (Near Zero) Robust / Healthy
Training Stability Poor / High Risk of Divergence High / Faster Convergence

[!IMPORTANT] Key Research Insight: Scaling the dot product is the primary mechanism that allows Transformers to utilize extremely high-dimensional latent spaces. Without this dk\sqrt{d_k} term, increasing the depth and width of a Transformer would lead to immediate failure in gradient propagation.

Want to link PomaiDB into your project?

Read the Engineering Manual