CNNs are a special kind of neural networks that are specifically designed to process data with a grid-like topology, such as images and time-series data. Unlike ANNs, which require the input data to be flattened into a 1D vector, CNNs preserve the spatial structure. The issue with flattening images is that the spatial relationship between the pixels are lost, which leads to poor generalization, a large number of parameters, and an increased risk of over-fitting. CNNs solve this issue by using learnable filters (known as kernels) and the convolution operation.
Before diving into CNNs, we need to understand what convolution is. If you've ever taken a class related to Fourier transforms in university, then you might have seen this formula:
$$ f(x) * g(x) = \int_{-\infty}^{\infty} f(\tau) * g(x - \tau) \; d\tau $$
This represents the convolution of two continuous functions, $f(x)$ and $g(x)$. In the context of CNNs, we use the definition of convolution that deals with discrete data. The core intuition of both are essentially the same. The convolution operation measures the overlap between one function and a reversed, shifted version of another. Within that integral, $f(\tau)$ is the function, and $g(x - \tau)$ is the reversed, shifted function. The multiplication between both of them measures "how much they agree with each other" or "how much they overlap with each other.”
Since images are mostly 2D in nature, convolution operations in CNNs are typically applied to matrices. Let's take a look at how it works. Consider two matrices of the shapes 3x3 and 2x2, respectively.
Note: Over here *
denotes convolution operator and not matrix multiplication
In the continuous convolution formula, the second function is reversed before being applied. The matrix equivalent of this reversal is a 180-degree rotation.
After the rotation, the second function is slid across the first function, and the overlap is measured. The matrix equivalent of this is sliding the second matrix across the first matrix and calculating element-wise multiplication followed by a sum.