PomaiDB | The predictable vector database for the edge

The fundamental operation of Convolutional Neural Networks (CNNs) is the extraction of local features through discrete convolution. This study formalizes the mapping of input tensors to feature spaces across varying dimensionalities, providing a unified mathematical framework for signals, spatial images, and volumetric tensors.

1. Formal Mathematical Framework

The convolution operation in discrete space is defined as the summed product of a sliding kernel and an input tensor. For an input tensor $X$ and a kernel $W$ , the resulting feature map $Y$ is determined by the dot product at every valid spatial coordinate.

1D Temporal Convolution

In sequential processing, the 1D convolution captures temporal dependencies along a single axis.

Mathematical Definition: $y[n] = (x * w)[n] = \sum_{k=-\infty}^{\infty} x[k] \cdot w[n-k]$

Dimensional Mapping & Convolutional Algebra: To formalize the feature space reduction, consider an input sequence of length $L$ and a temporal kernel of size $K$ . With padding $P$ and stride $S$ , the output dimension $L'$ follows the algebraic mapping: $L' = \left\lfloor \frac{L + 2P - K}{S} \right\rfloor + 1$ This mapping explicitly defines the dimensions of the feature tensor passed to subsequent computational layers, ensuring architectural compatibility across deep sequential hierarchies.

Application Analysis: 1D CNNs are optimized for sequence modeling where local connectivity is defined by temporal proximity. This is critical for high-frequency signal analysis and real-time telemetry processing in resource-constrained environments.

2. Spatial Mapping in 2D CNNs

Standard image processing employs 2D convolutions to extract local spatial features. The kernel $K$ slides along the height ( $H$ ) and width ( $W$ ) axes.

Mathematical Definition: $O[i, j] = \sum_{m} \sum_{n} I[i+m, j+n] \cdot K[m, n]$

Spatial Dimensionality Mechanics: For an input tensor parameterized exactly by height $H_{in}$ and width $W_{in}$ , adopting a kernel $K \in \mathbb{R}^{k_h \times k_w}$ , stride $S=(s_h, s_w)$ , and padding $P=(p_h, p_w)$ , the feature layer is spatially projected:

$H_{out} = \left\lfloor \frac{H_{in} + 2p_h - k_h}{s_h} \right\rfloor + 1$ $W_{out} = \left\lfloor \frac{W_{in} + 2p_w - k_w}{s_w} \right\rfloor + 1$

This mechanism mathematically governs the anisotropic reduction if strides or paddings are unequally distributed along the coordinate axes.

Receptive Field Calculation

The receptive field $R_l$ of a layer $l$ is defined as: $R_l = R_{l-1} + (k_l - 1) \prod_{i=1}^{l-1} s_i$ where $k_l$ is the kernel size and $s_i$ is the stride. This confirms that deeper layers aggregate increasingly global information from the input space.

3. Spatiotemporal Volumetric Inversion (3D CNNs)

3D convolutions extend the kernel dimensionality to capture information across height, width, and a third dimension—usually depth (volumetric data) or time (video).

Mathematical Definition: $V[x, y, z] = \sum_{i} \sum_{j} \sum_{k} I[x+i, y+j, z+k] \cdot K[i, j, k]$

Volumetric Tensor Transformation: Operating on volumetric spaces $V \in \mathbb{R}^{D \times H \times W}$ imposes a cubic reduction logic. Given depth $D_{in}$ , height $H_{in}$ , and width $W_{in}$ , the spatiotemporal window sizes $(k_d, k_h, k_w)$ and operation parameters $(s_d, p_d)$ dictate the output depth:

$D_{out} = \left\lfloor \frac{D_{in} + 2p_d - k_d}{s_d} \right\rfloor + 1$ (Alongside symmetrical $H_{out}$ and $W_{out}$ volume calculations).

This spatial-temporal tensor reduction is mathematically crucial for determining the computational complexity boundary $\mathcal{O}(D_{out} \cdot H_{out} \cdot W_{out} \cdot C_{in} \cdot C_{out} \cdot k_d \cdot k_h \cdot k_w)$ prior to execution.

Complexity Comparison: A 3D kernel of size $k \times k \times k$ has $k^3$ parameters. For a video sequence of length $T$ , the computational cost is proportional to $T \cdot H \cdot W \cdot k^3$ . To mitigate this, practitioners often utilize Pseudo-3D Convolutions (P3D) or (2+1)D blocks, factorizing the 3D kernel into separate spatial and temporal components.

4. Numerical Computation and Validation

To validate the implementation of these architectures, we consider a discrete 1D signal $x$ and a gradient-detecting kernel $w$ .

Experiment Data:

Input Signal: $x = [3, 1, 4, 1, 5, 9]$
Kernel: $w = [-1, 0, 1]$

Stepwise Execution:

Coordinate $n=1$ : $(x[0] \cdot w[0]) + (x[1] \cdot w[1]) + (x[2] \cdot w[2]) = (3 \cdot -1) + (1 \cdot 0) + (4 \cdot 1) = 1$
Coordinate $n=2$ : $(x[1] \cdot w[0]) + (x[2] \cdot w[1]) + (x[3] \cdot w[2]) = (1 \cdot -1) + (4 \cdot 0) + (1 \cdot 1) = 0$
Coordinate $n=3$ : $(x[2] \cdot w[0]) + (x[3] \cdot w[1]) + (x[4] \cdot w[2]) = (4 \cdot -1) + (1 \cdot 0) + (5 \cdot 1) = 1$

The resulting gradient tensor is $[1, 0, 1]$ , highlighting the spatial variations within the input sequence.

5. Memory Management in High-Dimensional Training

Processing high-dimensional tensors (especially 3D volumes) requires efficient memory allocation strategies. Data Generators and Lazy Loading (Yield) patterns are employed to stream data into the GPU/TPU memory space without loading the entire dataset into RAM.

Industrial Application: In medical imaging (CT/MRI) or 4K video analysis, where a single tensor can exceed available GPU memory, the use of yield for slice-wise or batch-wise ingestion is the standard for maintaining architectural stability and preventing Out-of-Memory (OOM) errors.

Mathematical Foundations of Convolutional Architectures: A Spatiotemporal Research Study