Hey there! In this blog post, we will explore how to extract the art style from an image of a painting and then apply it to another image, generating a new image that retains the content of the original but utilizes the art style of the painting. If you are not familiar with CNNs, I would highly recommend you to check out my other blogs on CNNs — CNNs Explained and Exploring different CNNs.
In the above image, a silly cat picture is combined with “The Starry Night” painting, resulting in a silly cat picture with a blue-ish tint everywhere and swirls, similar to those present in the Starry Night painting. The final output does look kinda scary tho.
Neural Style Transfer (NST) is a computer vision technique that utilizes deep neural networks to combine the content of one image with the style of another, resulting in a new image that resembles the first image painted in the style of the second image. In this blog post, we will go through the paper titled “Image Style Transfer Using Convolutional Neural Networks” which leverages the layer-wise representation of CNN to extract content and style from the image and then combine them.
CNNs consist of layers of small computational units (aka convolutional layers) that process the visual information hierarchically in a feed-forward manner, and each layer consists of a collection of image filters, each of which extracts a certain feature from the input image and the output of these layers is a “differently filtered” version of the input image. The representation of the image at the final layers cares more about the “content” of the image compared to its pixel values.
In neural style transfer, a pre-trained convolutional neural network (such as VGG19) is used to extract features from three images — the content image (the original photo), the style image (the artistic reference), and the generated image (which is randomly initialized at first). During each iteration, the network performs a forward pass on all three images, and features are extracted from specific layers to capture both content and style.
For extracting the content, outputs from the deeper layers are used, as these layers capture high-level information while discarding the fine pixel-level details. For extracting the style, we look at how different filters in the network respond together. Instead of using the filters directly, we use the correlations between them across the whole image. These correlations help describe the texture of the image. By doing this across several layers of the network, we get a multi-level summary of the image’s texture, without focusing on the exact layout of objects.
A pre-trained VGG-19 network is used, without the fully-connected layers at the end, as we only need the convolutional layers to extract the different features from content and style image. Initially, the content image is feed into the network and the outputs of conv4_2
layer are stored. Within the paper, conv4_2
is used as the content extractor — the layer which outputs the high-level representation of the image without focusing much on the pixel-level details.
Similarly, the style image is also feed into the network but for extracting style features, outputs from various convolutional layers are stored. The outputs of conv1_1
, conv2_1
, conv3_1
, conv4_1
and conv5_1
are used to extract the style features.