Image Generation with Normalizing Flow Models
11 Feb 2026

A normalizing flow is a method used to transform a simple probability distribution (like a standard bell curve) into a much more complex and interesting one. Imagine you're a sculptor, but you only start with a perfect, simple sphere of clay. Your goal is to create a complex sculpture of a cat. You can squeeze, squish & stretch (transform) the piece of clay. At its heart, a Normalizing Flow is a sequence of invertible transformations. We start with a simple base probability distribution (the Normal Distribution, zz) and "flow" it through a series of transformations (or steps) to reach a complex distribution (xx). Each transformation (or step) is designed to be invertible. Because every step is invertible, we can go in both directions:

  • Generation (zxz \to x): Sample from our base distribution and run it through all transformations.
  • Training (xzx \to z): Take a real image and "normalize" it back to our base distribution to calculate its likelihood.

How do you build a transformation that is complex enough to create images but simple enough to reverse? You use a Coupling Layer. Instead of transforming the whole image at once, we split it in half. We keep the first half (xAx_A) unchanged and use a neural network to look at it. That network then predicts a Scale (ss) and a Shift (tt) to apply to the second half (xB):yB=xBexp(s)+tx_B):y_B = x_B \cdot \exp(s) + t We use exp(s)\exp(s) because the scale must always be positive to keep the math from breaking. Because xAx_A was never changed, we can always use it to re-calculate the exact same ss and tt to undo the operation later! This is how we enforce invertibility in the coupling layer.

In probability, when you stretch or squash space, you change the probability "density" of the distribution. To keep our probabilities accurate, we need to track this change using the Jacobian Determinant of each transformation.

If we have a simple variable z with a known probability dist. pz(z)p_z(z) and we transform it using a function x=f(z)x = f(z), the new probability of x, px(x)p_x(x), is:

px(x)=pz(f1(x))det(f1(x)x)p_x(x) = p_z(f^{-1}(x)) \left| \det \left( \frac{\partial f^{-1}(x)}{\partial x} \right) \right|

The determinant of the Jacobian essentially tells us how much was our original dist. squeezed (or stretched). This formula is the heart of every normalizing flow. It's how we can compute the exact likelihood of any data point (for training). Normally, calculating a determinant for an image is a computational nightmare. But because of our coupling layer's design, the Jacobian matrix is triangular. This means the total "stretch factor" is just the sum of our scale values (ss). It’s a brilliant mathematical shortcut that turns a massive calculation into a simple addition.


It would be incredibly difficult to define one single, super-complex function that transforms samples from our simple normal distribution to pixels of a complex image.

Instead, normalizing flows use a clever strategy: they chain together many simple, reversible functions. Our original sample from the normal distribution ‘flows’ through a chain of Coupling Layers. Each layer’s output acting as input for the next. The output of the final layer is our final pixel.


The goal of training a normalizing flow is to find the best settings for all our coupling layers so that the transformed base distribution looks exactly like our real data.

We use Maximum Likelihood Estimation. It sounds complex, but the idea is simple: We want to adjust the parameters of our model to maximize the total probability it assigns to the actual training data.

Here's how it works for a single data point (e.g., one image of a cat from our dataset):

  1. Start with the Data: Take the cat image, which is a point x sampled from our complex target distribution.
  2. Go Inverse: Push this image backward through the entire flow (x -> z). This gives us the corresponding point z in our simple base distribution (the sphere).
  3. Calculate Base Probability: Since we know the exact formula for our simple base distribution (e.g., a Gaussian), we can easily calculate the probability of z, which is pz(z)p_z(z).
  4. Calculate the "Stretch": We also calculate the Jacobian determinant for each coupling layer along the inverse path. This tells us how much space was stretched or squished at each step.
  5. Find the Final Likelihood: We use the transformation formula we learned earlier to combine the base probability with all the Jacobian terms. This gives us the final probability, or likelihood, of our original cat image, px(x)p_x(x).
  6. Adjust: The model then uses standard deep learning techniques (like gradient descent) to slightly tweak the parameters of all the layers to make that final likelihood number just a little bit higher. The model repeats this process for every image in the dataset, thousands of times. Over time, the flow gets better and better at mapping the simple sphere onto the distribution of real cat images.

Let’s move on to writing the code for our normalizing flow model.

We first squeeze our odd-channelled 32x32x3 image into even channels 16x16x12.

We will use the affine coupling layer for our invertible transformations -

Here's how it works:

  1. It splits the input data into two halves, A and B across the channel dim.
  2. It leaves half A completely unchanged.
  3. It then uses a small neural network that only looks at half A to generate a scale (ss) and a shift (tt) value for each pixel in half B.
  4. It transforms half B using those values: Bnew=sB+tB_{new} = s \cdot B + t.
  5. Finally, it joins the unchanged half A and the new half B back together.
class MultiChannelAffineCoupler(nn.Module):
  def __init__(self, input_size, mask_type):
    super().__init__()
    # self.mask = nn.Parameter(mask, requires_grad=False)
    self.input_size = (input_size[0]//2, input_size[1], input_size[2])  # c//2,h,w

    if mask_type not in [0,1]: raise Exception("Invalid mask type")
    self.mask_type = mask_type # if 1 -> 'even' channels are unchanged or vice versa

    self.net = nn.Sequential(
      nn.Conv2d(in_channels=self.input_size[0], out_channels=self.input_size[0]*4, kernel_size=5, padding='same'),
      nn.ReLU(),
      nn.Conv2d(in_channels=self.input_size[0]*4, out_channels=self.input_size[0]*32, kernel_size=7, padding='same'),
      nn.ReLU(),
      nn.Conv2d(in_channels=self.input_size[0]*32, out_channels=self.input_size[0] * 2, kernel_size=3, padding='same') # twice the input channels - one for scale, one for shift
    )

  def forward(self, x, log_det_J): # Inverse pass for training (x -> z)
    # even or odd channels depending on mask type
    # new size will be -> b,c//2,h,w
    unchanged = x[:,self.mask_type::2,:,:]
    changed = x[:,(1-self.mask_type)::2,:,:]
    st = self.net(unchanged)
    s, t = st.chunk(2, dim=1)

    s = torch.tanh(s) # Stabilize training by constraining scale
    changed = (changed * torch.exp(s) + t)

    y = torch.zeros(x.shape)
    y[:,self.mask_type::2,:,:] = unchanged
    y[:,(1-self.mask_type)::2,:,:] = changed
    # ic(y.shape)

    # This is where we accumulate the log-determinant (sum of logs of scales)
    curr_det = torch.sum(s * (1 - self.mask_type), dim=(1,2,3))
    # ic(curr_det.shape)
    log_det_J += curr_det
    # ic(log_det_J.shape, y.shape)
    return y, log_det_J

  def inverse(self, y): # Forward pass for generation (z -> x), reverse the forward fn.
    unchanged = y[:,self.mask_type::2,:,:]
    changed = y[:,(1-self.mask_type)::2,:,:]
    st = self.net(unchanged)
    s, t = st.chunk(2, dim=1)

    # s = torch.tanh(s)
    changed = ((changed - t) * torch.exp(-s))

    x = torch.empty(y.shape)
    x[:,self.mask_type::2,:,:] = unchanged
    x[:,(1-self.mask_type)::2,:,:] = changed
    return x

Why is this so clever?

  • It's easily invertible: To go backward, we can use the unchanged half A to regenerate the exact same ss and tt, then reverse the math: B=(Bnewt)/sB = (B_{new} - t) / s.
  • The Jacobian is trivial: Because half the inputs don't change, the Jacobian matrix becomes "triangular," and its determinant is just the product of the diagonal elements or all the scaling factors (ss) we used. This is super fast to compute!

The Jacobian matrix, while complicated, is structured in a special way. Let's break the Jacobian into four blocks:

  • The top-left block shows how the first half of the output changes with the first half of the input. Since this part is unchanged (y_A = x_A), this block is the Identity matrix (1s on the diagonal, 0s elsewhere).
  • The top-right block shows how the first half of the output changes with the second half of the input. Since the first half doesn't depend on the second, this is a Zero matrix.
  • The bottom-left block shows how the second half of the output changes with the first half of the input. This is a complicated, non-zero matrix because the scale and shift for the second half depend on the first half.
  • The bottom-right block shows how the second half of the output changes with the second half of the input. This is a simple Diagonal matrix containing the scale values. The full Jacobian looks like this:

J=(I0yBxAdiag(scale))J = \begin{pmatrix} \mathbf{I} & \mathbf{0} \\ \frac{\partial y_B}{\partial x_A} & \text{diag}(\text{scale}) \end{pmatrix} Because of the Zero matrix in the top-right corner, this entire structure is called a lower triangular matrix. Calculating the determinant of this is trivial since we have a zero in upper right block, it is just the product of the leading diagonal.

The total "stretch factor" (determinant) of the entire flow is the product of the stretch factors from each individual layer. When we work with logarithms, this product becomes a sum, which is the reason we accumulate (sum up) the individual log-determinants for each layer. The sum of all the log determinants will give us the total stretch (or squeeze) factor of all transformations (coupling layers). We will add this (like a correction factor) to the likelihood from our original base distribution to calculate our total likelihood.


My grayscale (2D) image generation trained on MNIST digits, had a total of 300k parameters. It worked OK. Sample output below - normalizing flow mnist digits For coloured (3D) Image generation I used 12 affine coupling layers in my flow model for a total of approx. 3million parameters. The first iteration trained for 2 epochs on CelebFaces dataset produced garbage. It is still being trained for a further 6 epochs for a total of 10 epochs. Fingers crossed!