Engineering Blog
Scientific Research2026-04-06

Mathematical Foundations of Convolutional Architectures: A Spatiotemporal Research Study

Scientific Research Team|Industrial Case Study

The fundamental operation of Convolutional Neural Networks (CNNs) is the extraction of local features through discrete convolution. This study formalizes the mapping of input tensors to feature spaces across varying dimensionalities, providing a unified mathematical framework for signals, spatial images, and volumetric tensors.

1. Formal Mathematical Framework

The convolution operation in discrete space is defined as the summed product of a sliding kernel and an input tensor. For an input tensor XX and a kernel WW, the resulting feature map YY is determined by the dot product at every valid spatial coordinate.

1D Temporal Convolution

In sequential processing, the 1D convolution captures temporal dependencies along a single axis.

Mathematical Definition: y[n]=(xw)[n]=k=x[k]w[nk]y[n] = (x * w)[n] = \sum_{k=-\infty}^{\infty} x[k] \cdot w[n-k]

Dimensional Mapping & Convolutional Algebra: To formalize the feature space reduction, consider an input sequence of length LL and a temporal kernel of size KK. With padding PP and stride SS, the output dimension LL' follows the algebraic mapping: L=L+2PKS+1L' = \left\lfloor \frac{L + 2P - K}{S} \right\rfloor + 1 This mapping explicitly defines the dimensions of the feature tensor passed to subsequent computational layers, ensuring architectural compatibility across deep sequential hierarchies.

Application Analysis: 1D CNNs are optimized for sequence modeling where local connectivity is defined by temporal proximity. This is critical for high-frequency signal analysis and real-time telemetry processing in resource-constrained environments.


2. Spatial Mapping in 2D CNNs

Standard image processing employs 2D convolutions to extract local spatial features. The kernel KK slides along the height (HH) and width (WW) axes.

Mathematical Definition: O[i,j]=mnI[i+m,j+n]K[m,n]O[i, j] = \sum_{m} \sum_{n} I[i+m, j+n] \cdot K[m, n]

Spatial Dimensionality Mechanics: For an input tensor parameterized exactly by height HinH_{in} and width WinW_{in}, adopting a kernel KRkh×kwK \in \mathbb{R}^{k_h \times k_w}, stride S=(sh,sw)S=(s_h, s_w), and padding P=(ph,pw)P=(p_h, p_w), the feature layer is spatially projected:

Hout=Hin+2phkhsh+1H_{out} = \left\lfloor \frac{H_{in} + 2p_h - k_h}{s_h} \right\rfloor + 1 Wout=Win+2pwkwsw+1W_{out} = \left\lfloor \frac{W_{in} + 2p_w - k_w}{s_w} \right\rfloor + 1

This mechanism mathematically governs the anisotropic reduction if strides or paddings are unequally distributed along the coordinate axes.

Receptive Field Calculation

The receptive field RlR_l of a layer ll is defined as: Rl=Rl1+(kl1)i=1l1siR_l = R_{l-1} + (k_l - 1) \prod_{i=1}^{l-1} s_i where klk_l is the kernel size and sis_i is the stride. This confirms that deeper layers aggregate increasingly global information from the input space.


3. Spatiotemporal Volumetric Inversion (3D CNNs)

3D convolutions extend the kernel dimensionality to capture information across height, width, and a third dimension—usually depth (volumetric data) or time (video).

Mathematical Definition: V[x,y,z]=ijkI[x+i,y+j,z+k]K[i,j,k]V[x, y, z] = \sum_{i} \sum_{j} \sum_{k} I[x+i, y+j, z+k] \cdot K[i, j, k]

Volumetric Tensor Transformation: Operating on volumetric spaces VRD×H×WV \in \mathbb{R}^{D \times H \times W} imposes a cubic reduction logic. Given depth DinD_{in}, height HinH_{in}, and width WinW_{in}, the spatiotemporal window sizes (kd,kh,kw)(k_d, k_h, k_w) and operation parameters (sd,pd)(s_d, p_d) dictate the output depth:

Dout=Din+2pdkdsd+1D_{out} = \left\lfloor \frac{D_{in} + 2p_d - k_d}{s_d} \right\rfloor + 1 (Alongside symmetrical HoutH_{out} and WoutW_{out} volume calculations).

This spatial-temporal tensor reduction is mathematically crucial for determining the computational complexity boundary O(DoutHoutWoutCinCoutkdkhkw)\mathcal{O}(D_{out} \cdot H_{out} \cdot W_{out} \cdot C_{in} \cdot C_{out} \cdot k_d \cdot k_h \cdot k_w) prior to execution.

Complexity Comparison: A 3D kernel of size k×k×kk \times k \times k has k3k^3 parameters. For a video sequence of length TT, the computational cost is proportional to THWk3T \cdot H \cdot W \cdot k^3. To mitigate this, practitioners often utilize Pseudo-3D Convolutions (P3D) or (2+1)D blocks, factorizing the 3D kernel into separate spatial and temporal components.


4. Numerical Computation and Validation

To validate the implementation of these architectures, we consider a discrete 1D signal xx and a gradient-detecting kernel ww.

Experiment Data:

  • Input Signal: x=[3,1,4,1,5,9]x = [3, 1, 4, 1, 5, 9]
  • Kernel: w=[1,0,1]w = [-1, 0, 1]

Stepwise Execution:

  1. Coordinate n=1n=1: (x[0]w[0])+(x[1]w[1])+(x[2]w[2])=(31)+(10)+(41)=1(x[0] \cdot w[0]) + (x[1] \cdot w[1]) + (x[2] \cdot w[2]) = (3 \cdot -1) + (1 \cdot 0) + (4 \cdot 1) = 1
  2. Coordinate n=2n=2: (x[1]w[0])+(x[2]w[1])+(x[3]w[2])=(11)+(40)+(11)=0(x[1] \cdot w[0]) + (x[2] \cdot w[1]) + (x[3] \cdot w[2]) = (1 \cdot -1) + (4 \cdot 0) + (1 \cdot 1) = 0
  3. Coordinate n=3n=3: (x[2]w[0])+(x[3]w[1])+(x[4]w[2])=(41)+(10)+(51)=1(x[2] \cdot w[0]) + (x[3] \cdot w[1]) + (x[4] \cdot w[2]) = (4 \cdot -1) + (1 \cdot 0) + (5 \cdot 1) = 1

The resulting gradient tensor is [1,0,1][1, 0, 1], highlighting the spatial variations within the input sequence.

5. Memory Management in High-Dimensional Training

Processing high-dimensional tensors (especially 3D volumes) requires efficient memory allocation strategies. Data Generators and Lazy Loading (Yield) patterns are employed to stream data into the GPU/TPU memory space without loading the entire dataset into RAM.

Industrial Application: In medical imaging (CT/MRI) or 4K video analysis, where a single tensor can exceed available GPU memory, the use of yield for slice-wise or batch-wise ingestion is the standard for maintaining architectural stability and preventing Out-of-Memory (OOM) errors.

Want to link PomaiDB into your project?

Read the Engineering Manual