Exploring different CNNs

Hey there! In this blog post we are going to explore different CNNs and see how did they evolve from the foundational LeNet to much more complex networks like Inception, ResNet and ResNeXt,

AlexNet

AlexNet was introduced in the paper Image Classification with Deep Convolutional Neural Networks, and won 2012 ImageNet challenge by a huge margin. The first foundational CNN, LeNet, was introduced in the year 1995, but CNNs didn’t get picked up that fast by the computer vision community just after that.

LeNet achieved good results on small datasets but the performance and feasibility of training CNNs on larger datasets had yet to be established. The major bottleneck was compute power, the CPUs which were available around 1995 didn’t have enough compute power to apply CNNs in large scale i.e. on high resolution images.

AlexNet is sorta-like a precursor to LeNet, but with quite a few optimizations and changes:

AlexNet used a non-saturating activation function like ReLU instead of saturating activation functions like sigmoid and tanh to optimize the training process.
AlexNet utilized cross-GPU parallelization. It used two GTX 580 GPUs and each GPU handled half of the model’s kernels. There was an advantage with using that specific version of GPU as it allowed to read from and write to one another GPU’s memory directly, without going through another host machine. To reduce the communication overhead between the two GPUs, the communication between the two GPUs only happened in a certain layers of the network.
AlexNet introduced Local Response Normalization or LRN, which was used after ReLU activation function in a few layers, to improve the generalization of the network.

$$ b^{i}{x,y} = \frac{a^{i}{x,y}}{\left ({k + \alpha \sum^{\text{max}(N-1, i + n/2)}{\text{min}(0, i - n/2)} (a^{j}{x,y})^{2}} \right )^{\beta}} $$

The main idea of LRN is to boost neurons with strong activations while suppressing the nearby neurons with lower activations. It is very similar to the biological idea called **lateral inhibition, which is the capacity of an excited neuron to reduce the activity of its neighboring neurons.

LRN goes through each neuron present in each feature map of the output, in such a way that it goes through all the neurons present at $(x, y)$ position in all the adjacent $n$ feature maps and boosts the neuron with strong activation and suppresses the others in those adjacent $n$ feature maps.

According to the AlexNet paper, the values of $n, \; k, \; \alpha, \; \beta$ where were used are $k = 2, \; n = 5, \; \alpha = 10^{-4} , \; \beta = 0.75$.
Overlapping pooling layers were used in AlexNet — the stride is less than the kernel size in the pooling layer.

Architecture

The above image shows the complete end-to-end architecture of AlexNet, which contains 5 convolutional layers and 3 fully-connected layers. Let’s break it down into simple chunks.

A RGB image of size 224x224 is taken as an input to the network.
The image is passed into a convolutional layer whose kernels are split across two different GPUs.
The feature maps of each individual GPU are normalized and sent to a pooling layer, which is again processed on the individual GPU.
The normalized and pooled output of the first convolutional layer is then passed into the second convolutional layer, whose kernels are also split across two different GPUs.