This post follows directly from my previous post on Vector Quantised Variational Auto Encoders. The VQ-VAE comprises an encoder that maps observations onto a sequence of discrete latent variables, and a decoder that reconstructs the observations from these discrete variables. Both encoder and decoder use a shared codebook. The encoder produces vectors, the quantiser outputs the nearest vectors for each encoder vector & the decoder uses these quantised vectors to reconstruct the input image. I read about this method from the paper - Neural discrete representation learning by Van den Oord et al.
The original paper talks of VQ VAEs with one quantiser. Encoder → Quantiser → Decoder.
The authors came up with another paper an year later, where they talk of Hierarchical VQ VAEs. In their own words, they propose to use a hierarchy of vector quantized codes to model large images. The main motivation behind this is to model local information, such as texture, separately from global information such as shape and geometry of objects. This lets the different levels separately capture the big picture and the finer details. The top level captures the global features like shape & geometry while the lower level captures local finer details like texture. This "division of labor" seems to be the main motivation behind the Hierarchical VQ VAE. The model learns to encode the big picture and the tiny details separately, and that specialization leads to much higher-quality images.
The encoder's job is to take a high-resolution image and distill it down to a set of codes from our two specialized codebooks (top and bottom). Here’s how it does it, from the bottom up:
Now the decoder gets to work, using both sets of codes to reconstruct the cat picture.

The total training objective is a combination of a few key components, applied at each level of the hierarchy (i.e., for both the top and bottom codebooks). It is the same process used by VQ VAEs extended for each level in the Hierarchy.
This is the most straightforward part. It measures how closely the final output image from the decoder matches the original input image. The model's primary goal is to make this difference as small as possible.
This is where the magic happens. Since the act of choosing the nearest code from the codebook is like flicking a switch—it's a discrete choice—we can't use standard backpropagation. The model uses two special losses to learn the codebook and guide the encoder:
The vector quantization step is non-differentiable. To get the reconstruction error signal back to the encoder, the model uses a straight-through estimator. During the backward pass of training, the gradient skips over the discrete quantization step and is copied directly from the decoder to the encoder.
In the hierarchical model, this entire process is simply done for both the top and bottom levels. The final loss is the reconstruction loss plus the VQ losses for the top codebook and the VQ losses for the bottom codebook, all added together.

After a couple of weeks of bashing my head against the wall, I was able to create a Hierarchical VQ VAE that was good enough. I used 3 levels instead of 2 and the inverted bottleneck CNN layer from the ConvNext paper. My earlier experiments had all ended in abject failure with reconstructions looking like blurry blobs. The addition of convnext block was the key. The final model had 12 million parameters. I trained it on 2 million 32x32 images of the Cifar100 dataset for 3 epochs. For the first time, I had to shell out to rent a decent GPU for training, all thanks to vast.ai.

The left column are the original images while the right column are the model reconstructions.
After we’ve trained our hierarchical VQ VAE, its encoder can compress any image into a grid of discrete codes (e.g., a 32x32 grid of numbers from 0 to 63). The decoder can turn this grid of codes back into a realistic image. But what if we want to generate a completely new image? We need a way to create a new, valid grid of codes that looks “real”, one that would have come from a real image. We can't just pick random codes, as that would result in nonsense. This is where the prior model comes in. It's a separate model trained on the output of the VQ-VAE encoder. Its only job is to learn the structure and patterns in the compressed code feature maps.
The prior is typically an autoregressive model. This means it generates the grid of codes one by one, where the prediction for the next code depends on all the previous codes it has already generated. A powerful model like a PixelCNN++ or a Transformer is used. These models are excellent at capturing long-range dependencies in data. They use a technique called "masking" to ensure that when predicting a code at a certain position, the model can only see the codes that came before it (e.g., codes above it and to its left). The model is trained to predict the next code in the sequence. For example, given the first 100 codes in a 32x32 grid (1024 total codes), its goal is to predict the 101st code. It learns the conditional probability distribution .
A smaller autoregressive model is trained on the smaller, top-level code maps. This model learns the distribution of the high-level, global structures of the images. A second, much larger autoregressive model is trained on the bottom-level code maps. Crucially, this model is conditioned on the corresponding top-level codes. This means its prediction for a bottom-level detail code depends not only on the previous bottom-level codes but also on the overall structure provided by the entire top-level code map. To generate a new image from scratch, you would first use the top-level prior to generate a complete top-level code map. Then, you would feed that map into the bottom-level prior to generate the detailed bottom-level map. Finally, you would give both of these maps to the HVQ-VAE decoder to generate the final image.
The authors of the original paper used PixelCNN for the priors & so did I. I added a conditioning signal providing the coarse label of the individual images from the Cifar100 dataset. All the priors are CNN models with masked convolutions. The current pixel and every pixel to the right and bottom are zeroed out. This prevents the model from seeing the future pixels. This is a characteristic of autoregressive models. We dont want our model to peek into the future. I also used the Gated Convolution blocks from the authors’ earlier paper. The conditioning signal (class labels) were fed into Residual Conditioning blocks interleaved about the gated convolutions.
All in all, the 3 priors totalled about 1 million parameters. Too small, I know; & the results showed. The conditional generation did not work well. I could not find any papers that condition with 2 signals, the class labels and the other feature maps. I might need more powerful & deeper priors to be able to conditionally generate images. More research needed. The colab notebook for my experiments/training is here.
Neural discrete representation learning