Neural style transfer is a really cool computer vision technique that uses deep learning to combine the content of one image with the artistic style of another. Imagine taking a photo of your dog and repainting it in the style of Van Gogh's "The Starry Night.” We give the model a content image (what we want to see) and a style image (how we want to see it), and the algorithm magically blends them together.
![]()
The key idea is that a CNN pre-trained for image classification (like VGG19) has already learned to create a rich hierarchy of feature representations of an image.
Neural style transfer exploits this hierarchy in a brilliant way:
There are many algorithms used for neural style transfer. Gatys et al. came up with the original optimization based feature-correlation method. We will focus on Adaptive Instance Normalization (AdaIN) by Huang et al. AdaIN based models take both the content and style images as input at the same time and are designed to separate and then recombine the content and style features in a single forward pass. They successfully combine the speed of feed-forward networks with the flexibility of the original method by Gatys et al.
The crucial factor here is that the style of an image is largely captured by the mean and variance of the feature maps at different layers of a trained CNN.
The AdaIN layer is surprisingly simple and has no learnable parameters itself. It does its job in two steps:
The formula is elegantly simple:
It calculates the style statistics from the style image and applies them directly to the de-styled content image.
The only issue you should have seen by now is that the output of the AdaIN layer will not be an image but a feature map. This feature map needs to be transformed back into our image or pixel-space.
Here's how AdaIN layer fits into the full network:
The key is that only the decoder is trained. The VGG encoder is fixed, and the AdaIN layer has no weights. The decoder learns how to properly reconstruct an image from feature maps that have had the style-swap operation performed on them.
The content loss ensures that the output image preserves the structure of the original content. The logic is a bit different here:
t, are our target.t and generates an output image, g.g back through the encoder to get its features, feat(g).feat(g) and our target t.loss_content = MSE(feat(g), t)
The style loss ensures the output image matches the style of the style image. This is calculated by comparing the statistics (mean and variance) of the output image's features to the style image's features.
Here's the process:
g through the encoder and get its feature maps from multiple layers (e.g., relu1_1, relu2_1, relu3_1, relu4_1).g and the style image.The final loss that we use to train our decoder is a weighted sum of these two losses:
total_loss = loss_content + style_weight * loss_style
The author’s used VGG-16 as the encoder. Only the layers upto relu-4-1 are used for extracting the feature maps. The decoder is the exact VGG-16 model in reverse. The training was performed with a batch size of 8 for both the content & style image datasets. I have done the same. I trained the decoder for 100,000 steps on a RTX 4080S GPU. It took about 4 hours.


The content image followed by style image & the model output.
All my experiments are on this colab notebook. I can see artifacts along edges but it seems to be working well. A little more experimentation & I might convert this to an app!