A normalizing flow is a method used to transform a simple probability distribution (like a standard bell curve) into a much more complex and interesting one. Imagine you're a sculptor, but you only start with a perfect, simple sphere of clay. Your goal is to create a complex sculpture of a cat. You can squeeze, squish & stretch (transform) the piece of clay. At its heart, a Normalizing Flow is a sequence of invertible transformations. We start with a simple base probability distribution (the Normal Distribution, ) and "flow" it through a series of transformations (or steps) to reach a complex distribution (). Each transformation (or step) is designed to be invertible. Because every step is invertible, we can go in both directions:
How do you build a transformation that is complex enough to create images but simple enough to reverse? You use a Coupling Layer. Instead of transforming the whole image at once, we split it in half. We keep the first half () unchanged and use a neural network to look at it. That network then predicts a Scale () and a Shift () to apply to the second half ( We use because the scale must always be positive to keep the math from breaking. Because was never changed, we can always use it to re-calculate the exact same and to undo the operation later! This is how we enforce invertibility in the coupling layer.
In probability, when you stretch or squash space, you change the probability "density" of the distribution. To keep our probabilities accurate, we need to track this change using the Jacobian Determinant of each transformation.
If we have a simple variable z with a known probability dist. and we transform it using a function , the new probability of x, , is:
The determinant of the Jacobian essentially tells us how much was our original dist. squeezed (or stretched). This formula is the heart of every normalizing flow. It's how we can compute the exact likelihood of any data point (for training). Normally, calculating a determinant for an image is a computational nightmare. But because of our coupling layer's design, the Jacobian matrix is triangular. This means the total "stretch factor" is just the sum of our scale values (). It’s a brilliant mathematical shortcut that turns a massive calculation into a simple addition.
It would be incredibly difficult to define one single, super-complex function that transforms samples from our simple normal distribution to pixels of a complex image.
Instead, normalizing flows use a clever strategy: they chain together many simple, reversible functions. Our original sample from the normal distribution ‘flows’ through a chain of Coupling Layers. Each layer’s output acting as input for the next. The output of the final layer is our final pixel.
The goal of training a normalizing flow is to find the best settings for all our coupling layers so that the transformed base distribution looks exactly like our real data.
We use Maximum Likelihood Estimation. It sounds complex, but the idea is simple: We want to adjust the parameters of our model to maximize the total probability it assigns to the actual training data.
Here's how it works for a single data point (e.g., one image of a cat from our dataset):
x sampled from our complex target distribution.x -> z). This gives us the corresponding point z in our simple base distribution (the sphere).z, which is .Let’s move on to writing the code for our normalizing flow model.
We first squeeze our odd-channelled 32x32x3 image into even channels 16x16x12.
We will use the affine coupling layer for our invertible transformations -
Here's how it works:
class MultiChannelAffineCoupler(nn.Module):
def __init__(self, input_size, mask_type):
super().__init__()
# self.mask = nn.Parameter(mask, requires_grad=False)
self.input_size = (input_size[0]//2, input_size[1], input_size[2]) # c//2,h,w
if mask_type not in [0,1]: raise Exception("Invalid mask type")
self.mask_type = mask_type # if 1 -> 'even' channels are unchanged or vice versa
self.net = nn.Sequential(
nn.Conv2d(in_channels=self.input_size[0], out_channels=self.input_size[0]*4, kernel_size=5, padding='same'),
nn.ReLU(),
nn.Conv2d(in_channels=self.input_size[0]*4, out_channels=self.input_size[0]*32, kernel_size=7, padding='same'),
nn.ReLU(),
nn.Conv2d(in_channels=self.input_size[0]*32, out_channels=self.input_size[0] * 2, kernel_size=3, padding='same') # twice the input channels - one for scale, one for shift
)
def forward(self, x, log_det_J): # Inverse pass for training (x -> z)
# even or odd channels depending on mask type
# new size will be -> b,c//2,h,w
unchanged = x[:,self.mask_type::2,:,:]
changed = x[:,(1-self.mask_type)::2,:,:]
st = self.net(unchanged)
s, t = st.chunk(2, dim=1)
s = torch.tanh(s) # Stabilize training by constraining scale
changed = (changed * torch.exp(s) + t)
y = torch.zeros(x.shape)
y[:,self.mask_type::2,:,:] = unchanged
y[:,(1-self.mask_type)::2,:,:] = changed
# ic(y.shape)
# This is where we accumulate the log-determinant (sum of logs of scales)
curr_det = torch.sum(s * (1 - self.mask_type), dim=(1,2,3))
# ic(curr_det.shape)
log_det_J += curr_det
# ic(log_det_J.shape, y.shape)
return y, log_det_J
def inverse(self, y): # Forward pass for generation (z -> x), reverse the forward fn.
unchanged = y[:,self.mask_type::2,:,:]
changed = y[:,(1-self.mask_type)::2,:,:]
st = self.net(unchanged)
s, t = st.chunk(2, dim=1)
# s = torch.tanh(s)
changed = ((changed - t) * torch.exp(-s))
x = torch.empty(y.shape)
x[:,self.mask_type::2,:,:] = unchanged
x[:,(1-self.mask_type)::2,:,:] = changed
return x
Why is this so clever?
The Jacobian matrix, while complicated, is structured in a special way. Let's break the Jacobian into four blocks:
y_A = x_A), this block is the Identity matrix (1s on the diagonal, 0s elsewhere).scale and shift for the second half depend on the first half.scale values.
The full Jacobian looks like this:Because of the Zero matrix in the top-right corner, this entire structure is called a lower triangular matrix. Calculating the determinant of this is trivial since we have a zero in upper right block, it is just the product of the leading diagonal.
The total "stretch factor" (determinant) of the entire flow is the product of the stretch factors from each individual layer. When we work with logarithms, this product becomes a sum, which is the reason we accumulate (sum up) the individual log-determinants for each layer. The sum of all the log determinants will give us the total stretch (or squeeze) factor of all transformations (coupling layers). We will add this (like a correction factor) to the likelihood from our original base distribution to calculate our total likelihood.
My grayscale (2D) image generation trained on MNIST digits, had a total of 300k parameters. It worked OK. Sample output below -
For coloured (3D) Image generation I used 12 affine coupling layers in my flow model for a total of approx. 3million parameters. The first iteration trained for 2 epochs on CelebFaces dataset produced garbage. It is still being trained for a further 6 epochs for a total of 10 epochs. Fingers crossed!