PomaiDB | The predictable vector database for the edge

Softmax Distribution with and without Scaling

In the architecture of Transformers, the self-attention mechanism is defined by the Scaled Dot-Product Attention. While the dot product measures the similarity between Query ( $Q$ ) and Key ( $K$ ) vectors, the division by $\sqrt{d_k}$ is not a heuristic convenience but a critical stabilization required for numerical optimization.

This paper formalizes the relationship between vector dimensionality, variance explosion, and the subsequent saturation of the softmax function.

1. The Variance of High-Dimensional Dot Products

Consider two vectors, a Query $q$ and a Key $k$ , both of dimension $d_k$ . The dot product is defined as:

$q \cdot k = \sum_{i=1}^{d_k} q_i k_i$

Under standard initialization weights (e.g., Xavier or Kaiming initialization), we assume that the components $q_i$ and $k_i$ are independent random variables with a mean of zero and a variance of one:

$E[q_i] = E[k_i] = 0$
$Var(q_i) = Var(k_i) = 1$

1.1 Mathematical Derivation

The variance of the product of two independent random variables $q_i$ and $k_i$ with zero mean is: $Var(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = E[q_i^2]E[k_i^2] - 0 = 1 \times 1 = 1$

Since the components of the vectors are independent, the variance of their sum (the dot product) is the sum of their variances: $Var(q \cdot k) = \sum_{i=1}^{d_k} Var(q_i k_i) = d_k \times 1 = d_k$

Conclusion: As the dimensionality $d_k$ increases, the variance of the dot product grows linearly with $d_k$ . For modern models where $d_k = 64$ or $128$ , the resulting values often fall into extreme ranges (e.g., $+50, -60$ ).

2. The Softmax Saturation Problem

The output of the dot product is passed into the softmax function to translate similarity scores into a probability distribution:

$Attention(\hat{Q}, K, V) = \text{softmax}(S)V \text{, where } S_{ij} = \frac{q_i \cdot k_j}{\text{scale}}$

The softmax function's behavior is dictated by the relative differences between its inputs. When the variance of the input $S$ is large, the values $e^{S_i}$ diverge exponentially.

2.1 Gradient Vanishing

If the scores are large, the softmax function "saturates," producing a distribution that approaches a one-hot vector. In this saturated state, the local gradients of the softmax function become extremely small:

When $x$ is large, $\frac{\partial \text{softmax}(x)}{\partial x} \approx 0$ .

During backpropagation, these near-zero gradients effectively halt the updates to the weight matrices $W_Q$ and $W_K$ , leading to the Vanishing Gradient problem and preventing the model from converging.

3. Stabilization via $\sqrt{d_k}$ Scaling

To maintain the sensitivity of the softmax function, we must ensure that the variance of the input to the softmax remains independent of the dimension $d_k$ .

By applying a scaling factor of $c = \frac{1}{\sqrt{d_k}}$ , we utilize the property of variance $Var(cX) = c^2 Var(X)$ :

$Var\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = \left(\frac{1}{\sqrt{d_k}}\right)^2 Var(q \cdot k) = \frac{1}{d_k} \times d_k = 1$

This normalization ensures that the input to the softmax is roughly unit variance regardless of how large the model’s internal hidden dimensions grow.

4. Performance Comparison

Attribute	Unscaled Dot-Product	Scaled Dot-Product ( $1/\sqrt{d_k}$ )
Input Variance	$d_k$ (High)	$1.0$ (Controlled)
Softmax State	Saturated (Approaching One-Hot)	Smooth / Distributed
Gradient Flow	Vanishing (Near Zero)	Robust / Healthy
Training Stability	Poor / High Risk of Divergence	High / Faster Convergence

[!IMPORTANT] Key Research Insight: Scaling the dot product is the primary mechanism that allows Transformers to utilize extremely high-dimensional latent spaces. Without this $\sqrt{d_k}$ term, increasing the depth and width of a Transformer would lead to immediate failure in gradient propagation.

Scaling in Transformer Architectures: The Mathematical Rationale behind $\sqrt{d_k}$

1. The Variance of High-Dimensional Dot Products

1.1 Mathematical Derivation

2. The Softmax Saturation Problem

2.1 Gradient Vanishing

3. Stabilization via $\sqrt{d_k}$ Scaling

4. Performance Comparison

Want to link PomaiDB into your project?

Scaling in Transformer Architectures: The Mathematical Rationale behind $\sqrt{d_k}$

1. The Variance of High-Dimensional Dot Products

1.1 Mathematical Derivation

2. The Softmax Saturation Problem

2.1 Gradient Vanishing

3. Stabilization via dk\sqrt{d_k}dk​​ Scaling

4. Performance Comparison

Want to link PomaiDB into your project?

3. Stabilization via $\sqrt{d_k}$ Scaling