Adaptive Instance Normalization for Style Transfer

4 Oct 2025

Neural style transfer is a really cool computer vision technique that uses deep learning to combine the content of one image with the artistic style of another. Imagine taking a photo of your dog and repainting it in the style of Van Gogh's "The Starry Night.” We give the model a content image (what we want to see) and a style image (how we want to see it), and the algorithm magically blends them together.

The key idea is that a CNN pre-trained for image classification (like VGG19) has already learned to create a rich hierarchy of feature representations of an image.

Early Layers: These layers learn to detect simple features like edges, corners, and color blobs.
Deeper Layers: These layers combine the simple features to recognize more complex things like textures, patterns, parts of objects (an eye, a wheel), and eventually, entire objects (a face, a car).

Neural style transfer exploits this hierarchy in a brilliant way:

Content Representation: The activations from the deeper layers of the network are used to represent the content of an image. A deep layer doesn't care much about the exact colors or textures, but it knows if it's looking at a dog or a building.
Style Representation: The activations from the early and middle layers are used for the style. These layers are great at capturing textures, brushstroke patterns, and color schemes.

There are many algorithms used for neural style transfer. Gatys et al. came up with the original optimization based feature-correlation method. We will focus on Adaptive Instance Normalization (AdaIN) by Huang et al. AdaIN based models take both the content and style images as input at the same time and are designed to separate and then recombine the content and style features in a single forward pass. They successfully combine the speed of feed-forward networks with the flexibility of the original method by Gatys et al.

The crucial factor here is that the style of an image is largely captured by the mean and variance of the feature maps at different layers of a trained CNN.

Mean: Relates to the average color and tone.
Variance: Relates to the contrast and texture intensity.

The AdaIN layer is surprisingly simple and has no learnable parameters itself. It does its job in two steps:

Strip the Content's Style: It takes the feature maps (from the pre-trained CNN) of the content image and normalizes them per channel. This is called Instance Normalization. This process removes the original style (strips its mean and variance) from the feature map, effectively "bleaching" it of its texture and color palette, leaving only the structural information.
Apply the New Style: It then takes this "style-less" content and scales and shifts it using the mean and variance calculated from the style image's feature maps.

The formula is elegantly simple:

$AdaIN(content, style) = style_{std\_dev} * (normalized\_content) + style_{mean}$

It calculates the style statistics from the style image and applies them directly to the de-styled content image.

The only issue you should have seen by now is that the output of the AdaIN layer will not be an image but a feature map. This feature map needs to be transformed back into our image or pixel-space.

Here's how AdaIN layer fits into the full network:

Encoder: A pre-trained VGG-16 network acts as a fixed encoder. Both the content and style images are passed through it to extract their feature maps.
AdaIN Layer: This is the heart of the model. It takes the content features and the style features from the encoder and aligns them using the formula above.
Decoder: A separate decoder network is trained to take the modified feature maps from the AdaIN layer and translate them back into a full-resolution, stylized image.

The key is that only the decoder is trained. The VGG encoder is fixed, and the AdaIN layer has no weights. The decoder learns how to properly reconstruct an image from feature maps that have had the style-swap operation performed on them.

Content Loss

The content loss ensures that the output image preserves the structure of the original content. The logic is a bit different here:

The stylized features from the AdaIN layer, let's call them t, are our target.
The decoder takes t and generates an output image, g.
We then pass this output image g back through the encoder to get its features, feat(g).
The content loss is simply the Mean Squared Error (MSE) between feat(g) and our target t.

loss_content = MSE(feat(g), t)

Style Loss

The style loss ensures the output image matches the style of the style image. This is calculated by comparing the statistics (mean and variance) of the output image's features to the style image's features.

Here's the process:

Pass the output image g through the encoder and get its feature maps from multiple layers (e.g., relu1_1, relu2_1, relu3_1, relu4_1).
Do the same for the original style image.
For each layer, calculate the mean and standard deviation of the features for both g and the style image.
The style loss for that layer is the sum of the MSE between their means and the MSE between their standard deviations.
The total style loss is the sum of these losses across all layers.

Total Loss

The final loss that we use to train our decoder is a weighted sum of these two losses:

total_loss = loss_content + style_weight * loss_style

The author’s used VGG-16 as the encoder. Only the layers upto relu-4-1 are used for extracting the feature maps. The decoder is the exact VGG-16 model in reverse. The training was performed with a batch size of 8 for both the content & style image datasets. I have done the same. I trained the decoder for 100,000 steps on a RTX 4080S GPU. It took about 4 hours.

Avril + fall

pumpkin + mondrian

The content image followed by style image & the model output.

All my experiments are on this colab notebook. I can see artifacts along edges but it seems to be working well. A little more experimentation & I might convert this to an app!