Engineering Blog
Scientific Research2026-04-08

Automata as Memory: Decoding LSTM State Persistence in Terminal Sequences

Scientific Research Team|Industrial Case Study

LSTM Terminal State Mechanics

In the architectural study of Recurrent Neural Networks (RNNs), specifically the Long Short-Term Memory (LSTM) variant, a critical point of confusion often arises regarding the terminal behavior of the cell. At the final time-step LL of a sequence, the LSTM unit produces two distinct state vectors: the Cell State (CLC_L) and the Hidden State (hLh_L).

This paper formalizes the functional divergence of these states and clarifying their utilization in downstream computational layers.


1. Functional Duality: CtC_t vs. hth_t

The LSTM architecture is defined by its ability to modulate information flow through a system of gating mechanisms. This leads to a fundamental duality in its internal representation:

  1. Cell State (CtC_t): Representing the "Long-Term Memory" or the internal conveyor of information. It acts as a high-capacity buffer that persists across time-steps with minimal linear interactions, mitigating the vanishing gradient problem.
  2. Hidden State (hth_t): Representing the "Short-Term Memory" or the active output of the cell. It is a filtered, non-linear projection of the current Cell State, designed to be consumed by the next layer or the next time-step.

At the terminal step t=Lt=L, both vectors are computed, but their subsequent roles differ based on the network topology.


2. Mathematical Formalism of the Terminal Transition

The computation at the terminal step follows the standard LSTM transition equations. There is no structural deviation at t=Lt=L. The state updates are governed by the Forget (fLf_L), Input (iLi_L), and Output (oLo_L) gates:

fL=σ(Wf[hL1,xL]+bf)f_L = \sigma(W_f \cdot [h_{L-1}, x_L] + b_f) iL=σ(Wi[hL1,xL]+bi)i_L = \sigma(W_i \cdot [h_{L-1}, x_L] + b_i) C~L=tanh(WC[hL1,xL]+bC)\tilde{C}_L = \tanh(W_C \cdot [h_{L-1}, x_L] + b_C)

The final states are then derived as: CL=fLCL1+iLC~LC_L = f_L \odot C_{L-1} + i_L \odot \tilde{C}_L hL=oLtanh(CL)h_L = o_L \odot \tanh(C_L)

Where \odot denotes the Hadamard product. Once the sequence X={x1,x2,,xL}X = \{x_1, x_2, \dots, x_L\} is exhausted, the recursion terminates, and the states {CL,hL}\{C_L, h_L\} are passed to the terminal interface.


3. Downstream Topology and State Utilization

The decision to utilize hLh_L, CLC_L, or both is strictly task-dependent. We categorize these into two primary architectural patterns:

Pattern A: Many-to-One (Sequence Classification)

In tasks such as sentiment analysis or document categorization, the objective is to map a sequence to a single categorical distribution.

  • Mechanism: The network consumes the entire sequence. At t=Lt=L, only the Hidden State (hLh_L) is extracted.
  • Transformation: y=softmax(WyhL+by)y = \text{softmax}(W_y h_L + b_y).
  • State Persistence: The Cell State CLC_L is discarded as its role in maintaining long-term gradients is complete.

Pattern B: Many-to-Many (Sequence-to-Sequence / Encoder-Decoder)

In neural machine translation or generative tasks, the LSTM acts as an Encoder that must "hand over" the compressed context to a Decoder.

  • Mechanism: To preserve the maximum information density, the Encoder transmits both hLh_L and CLC_L.
  • Transformation: The Decoder is initialized such that h0dec=hLench^{dec}_0 = h^{enc}_L and C0dec=CLencC^{dec}_0 = C^{enc}_L.
  • State Persistence: Here, CLC_L is essential as it carries the "core" context that has not been filtered by the final output gate oLo_L, providing the Decoder with a richer initialization.

4. Architectural Conclusions

The distinction between the Hidden State and Cell State at the terminal step is not one of mathematical difference, but of functional utility. While hLh_L serves as the summarized representation of the sequence for immediate inference, CLC_L remains the primary vessel for long-distance context preservation. In multi-layered (stacked) LSTM architectures, both states are propagated vertically to the subsequent layer, ensuring the "memory" remains intact across the model's depth.

[!NOTE] Research Insight: While hLh_L is a filtered version of CLC_L, using hLh_L for prediction is generally preferred as it incorporates the non-linear gating logic necessary to suppress noise and focus on task-relevant features.

Want to link PomaiDB into your project?

Read the Engineering Manual