PomaiDB | The predictable vector database for the edge

LSTM Terminal State Mechanics

In the architectural study of Recurrent Neural Networks (RNNs), specifically the Long Short-Term Memory (LSTM) variant, a critical point of confusion often arises regarding the terminal behavior of the cell. At the final time-step $L$ of a sequence, the LSTM unit produces two distinct state vectors: the Cell State ( $C_L$ ) and the Hidden State ( $h_L$ ).

This paper formalizes the functional divergence of these states and clarifying their utilization in downstream computational layers.

1. Functional Duality: $C_t$ vs. $h_t$

The LSTM architecture is defined by its ability to modulate information flow through a system of gating mechanisms. This leads to a fundamental duality in its internal representation:

Cell State ( $C_t$ ): Representing the "Long-Term Memory" or the internal conveyor of information. It acts as a high-capacity buffer that persists across time-steps with minimal linear interactions, mitigating the vanishing gradient problem.
Hidden State ( $h_t$ ): Representing the "Short-Term Memory" or the active output of the cell. It is a filtered, non-linear projection of the current Cell State, designed to be consumed by the next layer or the next time-step.

At the terminal step $t=L$ , both vectors are computed, but their subsequent roles differ based on the network topology.

2. Mathematical Formalism of the Terminal Transition

The computation at the terminal step follows the standard LSTM transition equations. There is no structural deviation at $t=L$ . The state updates are governed by the Forget ( $f_L$ ), Input ( $i_L$ ), and Output ( $o_L$ ) gates:

$f_L = \sigma(W_f \cdot [h_{L-1}, x_L] + b_f)$ $i_L = \sigma(W_i \cdot [h_{L-1}, x_L] + b_i)$ $\tilde{C}_L = \tanh(W_C \cdot [h_{L-1}, x_L] + b_C)$

The final states are then derived as: $C_L = f_L \odot C_{L-1} + i_L \odot \tilde{C}_L$ $h_L = o_L \odot \tanh(C_L)$

Where $\odot$ denotes the Hadamard product. Once the sequence $X = \{x_1, x_2, \dots, x_L\}$ is exhausted, the recursion terminates, and the states $\{C_L, h_L\}$ are passed to the terminal interface.

3. Downstream Topology and State Utilization

The decision to utilize $h_L$ , $C_L$ , or both is strictly task-dependent. We categorize these into two primary architectural patterns:

Pattern A: Many-to-One (Sequence Classification)

In tasks such as sentiment analysis or document categorization, the objective is to map a sequence to a single categorical distribution.

Mechanism: The network consumes the entire sequence. At $t=L$ , only the Hidden State ( $h_L$ ) is extracted.
Transformation: $y = \text{softmax}(W_y h_L + b_y)$ .
State Persistence: The Cell State $C_L$ is discarded as its role in maintaining long-term gradients is complete.

Pattern B: Many-to-Many (Sequence-to-Sequence / Encoder-Decoder)

In neural machine translation or generative tasks, the LSTM acts as an Encoder that must "hand over" the compressed context to a Decoder.

Mechanism: To preserve the maximum information density, the Encoder transmits both $h_L$ and $C_L$ .
Transformation: The Decoder is initialized such that $h^{dec}_0 = h^{enc}_L$ and $C^{dec}_0 = C^{enc}_L$ .
State Persistence: Here, $C_L$ is essential as it carries the "core" context that has not been filtered by the final output gate $o_L$ , providing the Decoder with a richer initialization.

4. Architectural Conclusions

The distinction between the Hidden State and Cell State at the terminal step is not one of mathematical difference, but of functional utility. While $h_L$ serves as the summarized representation of the sequence for immediate inference, $C_L$ remains the primary vessel for long-distance context preservation. In multi-layered (stacked) LSTM architectures, both states are propagated vertically to the subsequent layer, ensuring the "memory" remains intact across the model's depth.

[!NOTE] Research Insight: While $h_L$ is a filtered version of $C_L$ , using $h_L$ for prediction is generally preferred as it incorporates the non-linear gating logic necessary to suppress noise and focus on task-relevant features.

Automata as Memory: Decoding LSTM State Persistence in Terminal Sequences

1. Functional Duality: $C_t$ vs. $h_t$

2. Mathematical Formalism of the Terminal Transition

3. Downstream Topology and State Utilization

Pattern A: Many-to-One (Sequence Classification)

Pattern B: Many-to-Many (Sequence-to-Sequence / Encoder-Decoder)

4. Architectural Conclusions

Want to link PomaiDB into your project?

Automata as Memory: Decoding LSTM State Persistence in Terminal Sequences

1. Functional Duality: CtC_tCt​ vs. hth_tht​

2. Mathematical Formalism of the Terminal Transition

3. Downstream Topology and State Utilization

Pattern A: Many-to-One (Sequence Classification)

Pattern B: Many-to-Many (Sequence-to-Sequence / Encoder-Decoder)

4. Architectural Conclusions

Want to link PomaiDB into your project?

1. Functional Duality: $C_t$ vs. $h_t$