Adam Kosiorek

Geometry in Text-to-Image Diffusion Models

Wed, 08 Mar 2023 16:23:00 +0000

Until recently, generative models were a bit like neural nets pre-2012 when AlexNet came out. People knew about them but kept asking what you could really use them for. Text-to-image models DALL-E and StableDiffusion, and the language model ChatGPT changed this–these models mark the AlexNet moment for generative modeling. The best part? These models are publicly available. So you can ask ChatGPT to tell a story while describing the scenery in detail every time it changes. Using a text-to-image model you can then translate that story into a movie¹, right? If this works, the movie will most likely contain changing cameras showing different parts of a 3D scene. As the camera moves, the scene might change according to the changing prompts used to generate the corresponding images. But will it work? Yes, kind of, but not out of the box, because these models do not have any mechanisms for generating 3D-consistent scenes. This blog will explore how we can use text-to-image models for generating 3D scenes–without retraining these models.

While no in-depth knowledge is required, it will be helpful to know what diffusion models and NeRF are. If you’d like to dig deeper, I recommend Sander Dieleman’s blog for an intro to diffusion models and a guide on how to make them conditional. For NeRF, check out the project website, and Frank Dellaert’s NeRF Explosion 2020 blog which provides a great overview of the history behind NeRF and its various extensions.

Coming back to stitching a movie from images: this is something you can use a text-to-image diffusion model for. Such image chaining is possible with diffusion models due to their ability to inpaint missing information (or to do image-to-image translation). We can just mask (think erase) a part of an image and ask a diffusion model to fill in the blank. The blank will generally be compatible with the unmasked parts of the image and the text prompt used to condition the model². See here for a cool demo of inpainting with StableDiffusion. This tutorial says a bit more about how StableDiffusion works and how inpainting is done. Note that if you can inpaint, you can also outpaint: by simply translating the image to the left, you can pretend that you masked the right side of the image (which is non-existent, but it doesn’t matter). The model will complete that right side of the image, effectively extending it.

So if you wanted to create an illusion of moving in a 3D scene represented by an image, you could just downscale that image (to move away) or upscale it (to move closer), and have the diffusion model fix any artifacts, right? The issue is that zooming out scales down everything the same way, but scaling as you move should depend on the distance from the camera (depth); Also you cannot walk forward, walk through doors, model occlusions or walk around and come back to the same place–the result would not be consistent with the previously-generated images. Here’s an example of what zooming out ad infinitum looks like.

To make the above work well, we would need to model not only the views of a given scene (images), but also the geometry (where things are, and where the camera that captured those views was). If we have the geometry, we can explicitly move the camera into a new position and capture the next image from there. If you can do this, you unlock a plethora of additional applications like generating whole scenes or 3D assets for virtual reality, computer games, or special effects in movies, for interior design, or any other artistic endeavor, really.

But building generative models of 3D scenes or objects is not easy. In my work, I focused on VAE-based generative models of NeRFs (NeRF-VAE and Laser-NV). In principle, these models offer very similar capabilities³. In practice, the quality of the generated 3D content is far far behind what text-to-image diffusion models generate these days. One reason is a different framework: GAUDI employs the diffusion modeling techniques used for image generation and applies them to 3D, which does result in better results than VAEs can provide. However, the model quality is still limited by the lack of high-quality 3D data.

While it is easy to scrape billions of images and associated text captions from the Internet, this isn’t the case for 3D. To do 3D modeling with NeRF (used in my work and in GAUDI above), you need several images and associated camera viewpoints for every scene, and, if you want to reach the scale of text-to-image models, you need millions if not billions of scenes in your dataset. This data does not exist on the Internet, because that’s not how people take (or post) pictures. Considering the scale, manually capturing such datasets is out of the question. The only respite is video, where different frames are captured from slightly different viewpoints, but video modeling opens up another can of worms: since the scene isn’t static, it is difficult to learn a scene representation that will be consistent across views (that preserves the geometry). The video diffusion models certainly do not offer multi-view consistency (Imagen Video, Make-a-Video). Nevertheless, video modeling with NeRF-based generative models is the most promising direction for future large-scale 3D models.

Text-to-Image Models Know About Geometry

But here’s the thing. We can play with the text-to-image models by manipulating the text prompt, which then shows that these models know about geometry. Perhaps the best example of this is DreamBooth.

Fig 4: DreamBooth allows one to associate a specific object with a text token and then place that token within different text prompts.

If text-to-image models really know about 3D geometry, maybe we don’t need all that 3D data. Maybe we can just use the image models and either extract their 3D knowledge or perhaps somehow nudge them to preserve geometry across multiple generated images. It turns out that both approaches are possible, do not require re-training of the text-to-image models, and correspond to extracting geometry from an image model (DreamFusion and Score Jacobian Chaining (SJC)), and injecting geometry into an image model (SceneScape), respectively.

Extracting Geometry from an Image Model

Given that text-to-image diffusion models⁴ can generate pretty pictures and know about geometry, it is natural to ask if we can extract that geometry from these models. That is, can we lift a generated 2D picture to a full 3D scene? The answer is, of course, yes. But why does it work? Because any 2D rendering of a 3D representation is an image, and if that representation contains a scene familiar to the image model (i.e. in the model distribution), that rendered image should have a high likelihood under the image model. Conversely, if the represented scene is not familiar to the image model, the rendered image will have a low likelihood. Therefore, if we start from a random scene, the rendered images will have a low likelihood under the image model. But if we then manage to compute the gradients of the image model likelihood with respect to the 3D representation, we’ll be able to nudge the 3D representation into something that has a bit higher likelihood under that image model. Although they differ in derivations, both DreamFusion and SJC come up with novel image-space losses that capture the score (the derivative of the log probability) of a NeRF-rendered image under a pre-trained large-scale text-to-image diffusion model that is then back-propagated onto the NeRF parameters.

In theory, you don’t even have to use a diffusion model: any image model that can score a rendered image will do, including a VAE or any energy-based model including a GAN discriminator, a contrastive model such as CLIP or even a classifier. Check out DreamFields which uses CLIP to generate images and the DreamFusion and RealFusion papers (described below), which compare diffusion score against CLIP for training a NeRF. As Ben Poole pointed out, this may not work well in practice, since modes do not usually look like samples (see Sander’s blog on typicality), and likelihood from a VAE or EBM may fail in high dimensions.

The next few subsections describe technical details and follow-ups that are self-contained and not necessary for understanding the remainder of the blog. Feel free to skip some of them (but do take a look at the figures to see the results).

DreamFusion/SJC Algorithm

Fig 5: Extracting geometry from a text-to-image model into a NeRF, taken from DreamFusion.

The simplified algorithm is as follows (the DreamFusion version):

Initialize a random NeRF and pick a text prompt for the diffusion model.
Pick a random camera pose.
Render an image at that camera pose using the NeRF.
Compute the score-matching loss under a pre-trained diffusion model.
Use the score-matching loss as a gradient with respect to the rendered image, and backpropagate it to NeRF’s parameters.
Go to step 2.

Of course, life is never that easy, and DreamFusion comes with several hacks, including changing the text prompt based on the sampled camera pose, clipping the scene represented by the NeRF to a small ball around the origin (any densities outside of the ball are set to zero), putting the rendered object on different backgrounds, additional losses that ensure e.g. that most of the space is unoccupied or that normals are well-behaved. Most of these tricks are designed to reveal bad learned geometry under the NeRF.e

Why Does Extracting Geometry Lead to Cartoonish Objects?

Fig 6: Images + depth maps generated by extracting geometry from StableDiffusion, taken from Score Jacobian Chaining.

As you can see in the above examples, extracting geometry from image models can produce nice but cartoon-ish looking 3D models of single objects which are rather poor quality. You can get higher quality with heavily engineered approaches like Magic3D, but the algorithm is not as pretty.

Why does the simple version not work that well? While no one really knows, I have some theories. First, the 3D representation is initialized with a random NeRF, which leads to rendered images that look like random noise. In this case, the diffusion model will denoise each of these images towards a different image as opposed to different views of the same scene. This makes it difficult to get the optimization off the ground, which may lead to training instabilities and lower final quality. Second, this approach relies on classifier-free guidance with a very high guidance weight, which decreases the variance of the distribution (and its multimodality, see the end of this blog for a further discussion).

Why Only Objects? What Happened to Full 3D Scenes?

Beyond just the low-ish quality, the “scenes” generated by extracting geometry into a NeRF show single objects as opposed to full open-ended outdoor or indoor scenes. This is at least partly associated with the distribution of the cameras. If you are trying to model a general 3D scene (a part of a city or an apartment), the distribution of viable cameras is tightly coupled to the layout of the scene. In an apartment, say, randomly sampling cameras will yield cameras that are within walls and other objects. This will result in an empty image, which is unlikely under the model. Optimization in such a case will lead to removing any objects that occlude the scene from the camera: in this case, it will remove everything, resulting in an empty scene. This is precisely why GAUDI models the joint distribution of indoor scenes and camera distributions (private correspondence with the authors).

View-Conditioned Follow-ups

Next, I’d like to describe RealFusion and NerfDiff: two different takes at extracting geometry from a diffusion model but in such a way that extracted geometry (NeRF) is consistent with a provided image. RealFusion is a view-conditioned version of DreamFusion. It does everything that DreamFusion does, but instead of a vanilla text-to-image diffusion model, the authors use DreamBooth to constrain the diffusion model to a specific object shown by a target image. In addition to forcing the NeRF to represent that object, it should result in lower-variance gradients for the NeRF and therefore better NeRF quality.

NerfDiff is similar, but instead of fitting a NeRF from scratch, the authors train a view-conditioned (amortized) NeRF. Another difference is that instead of using a pretrained text-to-image diffusion model, NerfDiff fits a custom view-conditioned (not text-conditioned) diffusion model jointly with the amortized NeRF on the target dataset of scenes. Why? Because diffusion models tend to achieve much better image quality than amortized NeRFs at the cost of not being consistent across different views. The amortized NeRF allows a fast NeRF initialization from a single image, which is then fine-tuned with distillation from the diffusion model. The authors also introduce a novel distillation algorithm that improves on DreamFusion/SJC quite a bit (but is quite a bit more expensive). NerfDiff can produce NeRFs only from images that are similar to the training images; RealFusion doesn’t have this issue because it uses a pretrained large-scale diffusion model.

Injecting Geometry into an Image Model

This idea is almost the polar opposite: instead of distilling geometry from the image model and putting it somewhere else, we will use our understanding of 3D geometry to guide the image model to generate images that look like they represent the same scene but are generated from different camera poses.

The main insight behind the SceneScape algorithm is that an image diffusion model can correct image imperfections with its superb inpainting abilities. Now imagine that we have an image captured from a given camera position, and we pretend to move to a different camera position. Can you imagine how that image would look from the new viewpoint? You will mostly see the same things, just from a different distance and angle; some things will now be missing, and you will see some parts of the scene that you were not able to see before. It turns out that you can do this operation analytically by warping the original image into the new viewpoint. Warping results in an imperfect image:

Specularities and other view-dependent lighting effects will be incorrect.
It will have holes because not everything was observed.

But mostly, the image will look ok. The diffusion model can fill in the holes, and possibly even fix the lighting artifacts: there you go, we just created a new image, taken from a different camera position, that is geometrically consistent (distances are respected) and semantically consistent (the things visible in the first image are still there and are the same). The best part? We used an off-the-shelf pretrained image model. It doesn’t even have to be a diffusion model: all we need is the inpainting ability.

Fig 7: SceneScape is a bit more advanced than the simplified algorithm described above, but the idea is the same.

Technical: SceneScape Algorithm

A naive version of the SceneScape algorithm requires:

a pretrained text-to-image diffusion model capable of inpainting missing values,
a pretrained depth-from-a-single-image predictor (required for warping (above) or mesh building (below)),
a text prompt,
and optionally an image to start from,
and a method to infer intrinsic camera parameters for an RGBD image.

We then do the following:

Generate an initial image (or use the one you want to start with). Initialize the camera position and orientation to an arbitrary value.
Predict the depth for that image.
Infer intrinsics for the RGBD image that you now have. You will only have to do this once as hopefully, the diffusion model will preserve the camera parameters when inpainting missing values.
Change the camera position and orientation.
Project the previously-generated RGBD images onto the new camera pose (this is where intrinsics come into play). It will contain holes.
Feed the projected RGB image into the diffusion model and fill in any missing values. Go to step 2.

In the paper, the authors start by generating an image from a text prompt. Camera intrinsics are necessary to render previously-generated RGBD images onto a new camera position. The paper assumes just an arbitrary fixed camera model, which introduces errors, but apparently the diffusion model is able to fix that, too. I augmented the algorithm a little to allow starting from a real image and to reduce the reprojection errors from incorrect camera intrinsics.

Fig 8: SceneScape is a bit more advanced than the simplified algorithm described above, but the idea is the same.

Only it turns out that there are rough edges that need to be smoothed out (as done in the paper):

Reprojection from previously captured RGBD images is not great and is much better done by building a mesh as a global scene representation.
The depth predicted from single images is inconsistent across the images (the differences between depth do not respect the changes in camera position), so the authors fine-tune the depth predictor: after projecting the mesh on a new camera they fine-tune the depth predictor to agree with the depth that came out from that projection. Once the depth predictor agrees with the mesh, we can predict the values for the holes in the depth map. This requires optimization of the depth predictor at every generated frame. The authors don’t mention how many gradient steps it takes.
The authors use StableDiffusion as their text-to-image model, which is a Latent Diffusion model operating on embeddings of a VAE trained with perceptual and adversarial losses. Since the VAE did not optmize reconstruction error, autoencoding results in somewhat low reconstruction quality. Therefore, to reconstruct an image that fits visually with previously-observed frame, the authors need to finetune the VAE decoder to improve its reconstruction quality. Similarly to the depth predictor, they first optimize it so that it agrees on these parts of the image that are reprojected from the mesh and then use the finetuned decoder to fill in any holes (RGB and depth will have the same holes).
Lastly, the inpainted part of the frame may not agree semantically with the text prompt very well; they generate multiple frames and then use cosine distance between the CLIP embeddings of the text and the generated frames to choose the frame that is best aligned with the prompt.

Limitations:

The mesh representation doesn’t work well for outdoor scenes (depth discontinuities between objects and the sky).
There is error accumulation in long generated sequences that sometimes lead to less-than-realistic results.

While not stated in the paper, Rafail mentioned that they finetune the depth predictor and the VAE decoder for 300 and 100 gradients steps, respectively. It takes about an hour to generate 50 frames on a Tesla V100.

Why is it important for the image model to be text conditioned?

I left this discussion until after describing the two approaches of extracting and injecting geometry because it requires understanding some technical details about how these methods work.

Generally speaking, modeling conditional probability distributions is easier than modeling unconditional ones. This may seem counter-intuitive at first, because to modal a conditional probability \(p(x \mid z)\) you have to learn the relationship between \(x\) and \(z\), which you don’t have to do if you are modeling just \(p(x)\). While that is true, \(p(x)\) is generally a much more complicated object than \(p(x \mid z)\). To see this, look at a Gaussian mixture with K components. In this case, to recover the true \(p(x)\) with a learned \(\widetilde{p}(x)\), we have to parametrize \(\widetilde{p}(x)\) with a family of distributions expressive enough to cover the 10 different modes. If, however, we model the conditional \(p(x \mid z)\) where \(z\) now is an index telling us which mode we care about, the learned \(\widetilde{p}(x \mid z)\) has to model just one mode at a time. In this example, it can be just a Gaussian. A larger-scale example is that of ImageNet with 1000 different classes. In that case, you can think of the data distribution as a mixture of 1000 components, but now the components are very high-dimensional (images of shape 224x224x3), and the individual components are highly non-Gaussian, so the problem is much more difficult. Modeling conditionals in this case is way simpler.

So what does this have to do with image models and geometry?

I did some experiments with a DreamFusion-like setup, where I played with an unconditional and a view-conditional image model trained from scratch on a smaller dataset. It turns out that if the image model is unconditional, the gradients that it produces to train the NeRF point in a multitude of different directions. What happens in practice is that the NeRF initially starts to represent a scene, but eventually that scene disappears and the NeRF represents just empty space. This changes when we introduce conditioning: either a text prompt describing an object (like in DreamFusion or SJC), or an image (like in RealFusion or NerfDiff). The bottom line: too many modes lead to too high a variance of the gradients used to train the NeRF. Decreasing the number of modes leads to better-behaved gradients and thus learning.

A very similar argument applies to injecting geometry into an image model. One of the limitations of SceneScape is the accumulation of errors. This is partly mitigated by generating more than just one inpainting of the image from a new camera position, and then choosing the one that best aligns with the text prompt under CLIP similarity. So if the distribution of the image model had many more modes (if it was unconditional), it would be much more likely to inpaint missing parts of the image in a way that is not very consistent with the presented image, leading to faster error accumulation. If the model wasn’t text-conditioned, the authors couldn’t have done the CLIP trick of choosing the most suitable image in the first place, which would have significantly exacerbated the error accumulation.

So we see that the ability to model insanely complex distributions (unconditional distributions of real images) is counter-productive. Perhaps that’s ok because whenever we want to generate an image, we would like to have some control over what we’re generating. However, this suggests a future failure case. As the generative models get bigger, more expressive, and trained on more data, they will represent distributions with more and more modes. This is true even for conditional models. Does it mean that, with the advances in generative modeling, the approaches of injecting and extracting geometry (and anything that requires constraining the variance of the distribution) will stop working? As with anything, there will be workarounds. But it’s an interesting failure case to keep in mind.

Conclusions

While I’m not sure what I said with this blog, what I wanted to say is this⁵. There is value in making generative models. Ideally, we would be able to train such models on large datasets of 3D assets, or from videos. But this is difficult because there isn’t enough 3D data, and modeling videos while also modeling the geometry of the underlying scenes is tricky. So if it suits your application, why not try a simpler approach? Maybe you can take an off-the-shelf text-to-image diffusion model, and then massage it a bit so that it gives you a 3D model instead of just a 2D image. There you go.

Acknowledgements

I would like to thank Heiko Strathmann and Danilo J. Rezende for numerous discussions about topics covered in this blog. I also thank Jimmy Shi, Hyunjik Kim, Leonard Hasenclever, Adam Goliński, and Heiko for feedback on an initial version of this post.

Also thanks to Rafail Fridman and Ben Poole who provided feedback on the SceneScape and DreamFusion coverage in this blog, respectively.

Footnotes

Getting a nice picture out of a text-to-image model may require tinkering with the prompt a bit. It’s not as easy as one might think. It’s called prompt engineering. The example above works in principle because it’s just an elaborate example of prompt engineering. ↩
You can also place a fragment of a different image in the masked part to seed the result. E.g. in the demo above the author erases a part of the foreground, puts a lamp in there, and lets the model do its magic. The result is a lamp that fits stylistically with the rest of the image. ↩
We never did text-conditional modeling, but it’s easy to add text-conditioning to the prior if you have paired text-3D data. ↩
It doesn’t even have to be DreamBooth; standard text-to-image models know just as much about geometry. Unlike in DreamBooth, though, diffusion models will render different scenes for different prompts, so it’s harder to verify that different prompts do, in fact, correspond to different views. ↩
This is a paraphrase of Neil Gaiman from one of his speeches, taken from his book “The View from the Cheap Seats: Selected Nonfiction”. ↩

Masking for Representation Learning in Vision

Mon, 04 Jul 2022 10:59:00 +0000

Masked-image modeling (MIM) is about inpainting; that is, covering parts of an image and then trying to recover what was hidden from what is left. Recently, it has led to state-of-the-art representation learning in images¹. In this blog, I will dive into why masked images deliver such a powerful learning signal, think about what may constitute a good mask, and discuss my recent paper (ADIOS) which attempts to learn good masks for representation learning. But let’s start with some motivation.

Masking and the Brain

Have you ever covered an object you saw with your hand and tried to imagine what the covered part looks like? If not, why not give it a try? You may be unable to draw or paint it since that requires considerable skill. You may not even be able to see it clearly in your mind’s eye. Yet, you know what it is or what it can be used for—in other words, you have a good representation of it. Getting such representations is, roughly, the goal behind masked-image modeling (MIM).

Reconstructing the hidden part from the visible parts is called image inpainting, or more generally, missing-data imputation.² While MIM models are usually trained via image inpainting, we will see later on that reconstruction is not always necessary for learning good represetations. But actually, this is what your brain is doing all the time!

Fig 1: Blind spot of the human eye. The illustration is thanks to George Retseck.

Each of your eyes has a visual blind spot; as shown in the figure above. It’s roughly in the middle vertically, and slightly off-center to the outside for each eye. You don’t see anything there because it’s the place where the optic nerve connects to the eye, leaving no place for photoreceptors. And yet, you are unaware that any information is missing: you seem to see what is hidden. See for yourself!

Fig 2: Test your blind spot: cover your left eye, and focus your right eye on the plus (or do the opposite for the left eye). Move closer to the screen, such that the distance to your face is roughly three times the distance between symbols. Move your head back and forth. At some point, the circle should disappear. That's your blind spot! Inspired by Wikipedia.

If you followed the test in in Fig. 2, you know that the blind spot exist—and that we should be experiencing its effects whenever we use our eyes to observe the world. How come this is not so? This is where the magic of unconscious perception comes in: our brain inpaints the “blinded” area for us, to the point where we don’t even know that blind spots exist! This may be based on what is around that area but also using the view from the other eye (novel-view synthesis³) and what the brain is expecting to see in a given context.

I expect that the ability to inpaint in the brain is not innate and that the brain has to learn how to do it. If this is the case, is this something that guides the brain in learning good visual representations? Given the very impressive representation learning results of recent MIM models, I wouldn’t be surprised if it was the case.

It is also interesting that even though the brain is really good at inpainting (people don’t usually know about their blind spots) or imagining (e.g., vivid dreams), this is not a capability we control consciously. Think about that object you covered: you know what it is, but you probably cannot project a pixel-perfect rendering in your mind. This is rarely problematic, because conscious reasoning relies on high-level abstractions, not pixel-level detail. Since the representations we try to learn are usually used in such higher-level reasoning tasks, perhaps reconstruction is not the right way to go?

We will come back to this question later. For now, we will look at a few methods that do involve reconstruction.

BERT or Why Inpaint for Representation Learning?

Because it works—as shown by BERT of Devlin et al. in 2018. BERT is a large transformer trained to fill in missing words in natural language sentences based on the available words. Why is this useful? Because words typically represent concrete objects or abstract entities, their properties, and relations between them. To predict which word makes sense in the presence of other words, is to analyze what objects and with what properties are represented in that sentence, and what the relations between them are. A model that learns to do that learns many truths about the world⁴.

So why not do this for vision? You can, but it is not as straightforward as pushing a masked image through a CNN. If that’s what you do, you get the Context Encoder (CE) by Pathak et al., which came out in 2016, two years before BERT. CE used a small CNN (AlexNet-based) in an encoder-decoder setup. The images are either masked by a single large-ish rectangle, multiple smaller rectangles, or the ground-truth segmentation mask from another image. While the learned representations are ok, their performance is far behind supervised models of the time, even when fine-tuned.

Fig 3: Context Encoder; from left: masked input, reconstruction, three examples of different masks used for CE.

Why? First, there is an architectural issue. CNNs are great at correlating pixels. But filling-in missing words is about reasoning about objects, parts, properties, and relations. This is what transformers are really good at, but at the time, there was no good way of using transformers for vision. Second, there is a representation issue. Words in natural language are fundamentally different from pixels in images. So masking just a few random rectangles is unlikely to bear similar results to masking words.

It was the Masked Autoencoder (MAE) by He et al. that finally proved that image inpainting can lead to state-of-the-art representations for images. Coming five years after CE, it brought in recent advances. The encoder is a large vision transformer (ViT, Dosovitskiy et al.). The image is split into a rectangular grid, as in ViT, and a number of grid elements are masked. This paper provides two insights:

The representation quality improves with the fraction of the image masked (up to a point).
Instead of feeding an image with masked parts to the encoder, it is better to just not use the masked parts as an input⁵. This is easy to do for an image-divided-into-patches and a transformer like in ViT, but next to impossible for a CNN.

Fig 4: MAE architecture; note that the masked patches are not fed into the encoder.

MAE masks consist of small, randomly-scattered rectangles corresponding to the ViT image patches. They cover 75% of the image, which is significantly more than in CE⁶.

Why is this important? Because masking a large proportion of the image makes it more likely to mask visual words.

What is a Visual Word?

A word typically represents an entity, its property, or a relation between entities. A pixel represents a color. A visual word is a group of pixels, but it is not a random group. Rather, it’s a group of pixels that represents something meaningful like an object, but also a property or a relation.⁷. Imagine a man wearing a red jacket.

To mask “red”, we need to occlude most of a jacket, perhaps leaving its outline.
To mask “jacket” without masking its color, we can mask its outline but leave a pixel here or there.
To mask the fact that someone is wearing the jacket, we need to mask out a person while leaving fragments of the jacket.

Such groupings are far from random and are extremely unlikely to occur with random masks. Masking a significant area of the image, like in MAE, makes it easier to occlude visual words (whole entities, say). As the paper shows, such masks are also better for representation learning. Still, masking properties or relations remains difficult under that scheme.

Finding Visual Words

Let’s assume that, for representation learning, masking single words in natural language sentences is the best thing to do. How do we get such visual-word masks for images?

We would need to identify image regions that are similar in meaning to words. Object bounding boxes or segmentation masks would be a good choice if not for two issues. First, they usually cover objects, with no masks or boxes describing relations between objects or parts thereof⁸. Second, they are human-generated, which defeats the purpose of unsupervised learning. Let’s explore alternatives.

Visual Words from Before Deep-Learning

The concept of a visual word has been studied before in the pre-deep-learning era. Inspired by bag-of-words classifiers for natural language (e.g., an SVM operating on word histograms, the so-called bags of words), people constructed visual bag-of-words classifiers.

Fig 5: The visual bag-of-words framework.

Dictionaries of visual words were built by running a SIFT or SURF keypoint detector on a dataset of images, describing these keypoints with relevant descriptors (SIFT, SURF, HOG), and then clustering them. The cluster centroids represented a new visual grammar. A new image could be classified by creating a histogram of such visual words and feeding it into an SVM, say. A visual word like that could correspond to an eye or a car wheel. While I haven’t tried it, it would be interesting to adapt this paradigm for MIM.

Learning Visual Words

The modern alternative is to learn what a visual word is. To understand how visual words can be learned, let’s think about what categories of masks we can expect. We can do it by looking at some air balloons.

Fig 6: Inpainting a part of an object or background is easy. Inpainting a whole object is difficult. Adapted from Klaus Greff's talk "What are Objects?".

If you occlude a piece of the background, in this case, an empty sky, you can easily fill that piece in. If you hide a random part of an object, you can easily imagine that hidden part—its contents are largely defined by the visible parts of the object. If you hide a semantically-meaningful piece of an object, e.g., the balloon part of an air balloon, you have a somewhat harder task. Now you know that there should be a balloon because you can see a basket. Based on the context, you know that it probably belongs under a balloon. But the balloon can have a range of sizes and can be painted in many different ways, which increases the difficulty of the task. Finally, you can hide the whole object. This is virtually indistinguishable from hiding a piece of the background. You will have a hard time figuring out what the object was or if there was an object at all. The only way to do this is to check if it would make sense for any particular object to be there, given the visible surroundings.

This gradation of difficulty in different masking scenarios stems from the fact that some pixels are predictable⁹ from each other, while others are not. For a longer discussion, see section 4.1.1. of Greff et al, “On the Binding Problem in Artificial Neural Networks”. For now:

Pixels belonging to an object are strongly correlated with each other.
Pixels belonging to different objects or an object and the background are not correlated or are correlated only very weakly¹⁰.

By now, this is a widely-accepted view. I would go a step further and say that pixels representing a relation⁷ (e.g., two objects that often appear together), or a property, are also strongly correlated; therefore, they are possible to infer from a partial observation.

The above intuition can be formalized as a training objective. This is exactly what we do in Shi et al., “Adversarial Masking for Self-Supervised Learning”, ICML 2022 (code).

Adversarial Inference-Occlusion Self-supervision (ADIOS)

ADIOS is a reconstruction-free MIM model that learns to mask in an adversarial fashion.

Imagine a setup where you try to inpaint an image with some parts occluded. To get the mask, we instantiate a masking model whose job is to make inpainting as difficult as possible, subject to some constraints (see below). The result? You get masks that seem to hide objects or their parts. You also get better representation learning results than with using MAE’s masks¹¹.

What constrains masking whole objects, or the entire image for that matter? First, we predict several masks while making sure that each pixel is masked only once. Second, we penalize the masks so that they cannot be all black or all white. These two constraints mean that none of the predicted masks can cover the whole image and that the image must be partitioned between all masks. Third, there are built-in inductive biases in the form of the masking net architecture (Convolutional UNet pays more attention to texture than semantics) and the encoder architecture (ViT seems to result in masks that look more semantically-meaningful than when a ResNet is used).

Recall that MIM models are trained by reconstructing occluded images, similar to how the brain inpaints the visual blind spot. But since we are not interested in pixel-perfect detail but rather high-level, conscious-like reasoning abilities, we may be able to get away without reconstruction. That’s why we resort to reconstruction-free representation learning (RFL)¹².

Fig 7: Masks generated by ADIOS on the STL-10 dataset. There are six color-coded masks for each image. While some parts appear random, some clearly cover object parts.

ADIOS applies to any siamese-style representation learning algorithm (contrastive or otherwise) where the training is done by minimizing some distance between representations. Here we compare a generic algorithm with its ADIOS-augmented version. The ADIOS-specific parts are highlighted in green.

Generic Reconstruction-Free Learning

Take an image x.
Create two views of that image, a and b.
skip
Encode the views with a neural net with parameters θ to get two representations z_a and z_b.
Compute a loss L(z_a, z_b).
Update the parameters of the neural net(s) by minimising that loss with respect to θ.

ADIOS

Take an image x.
Create two views of that image, a and b.
Predict a mask m = mask(b) with a neural net with parameters φ. Apply that mask to b.
Encode the views with a neural net with parameters θ to get two representations z_a and z_b.
Compute a loss L(z_a, z_b).
Update the parameters of the neural net(s) by minimising that loss with respect to θ and maximising with respect to φ.

In ADIOS, we want one of the image views, say b, to be masked. The mask m = mask(b) is conditioned on the image and is predicted by another neural net with parameters \(\phi\). We get a masked image \(b^m = b \circ m\) by applying the mask to the image (via element-wise multiplication \(\circ\)), and extract representation \(z_b^m\). At the end, in addition to updating the encoder’s parameters, we also update the parameters of the masking neural net by maximizing the loss L with respect to \(\phi\).

That’s it! It’s simple, isn’t it? A cool thing is that is works with many different RFL objectives (we tried BYOL, SimCLR, and SimSiam), and it improves representation learning performance on every dataset and task we tried. Additionally, ADIOS improves robustness to non-adversarial attacks (e.g., changing the background behind an object), presumably due to decreasing sensitivity to spurious correlations (these are often masked separately from the object due to the correlation structure discussed above).

How Does Masking Apply to Reconstruction-Free Learning (RFL)?

RFL minimizes the distance between representations extracted from two views of the same image. That distance is minimized when the encoders are invariant to the transformations applied to the source image. Here is a simple example: if we use a color image and a grayscale version of that same image, we will get a representation that encodes the content (e.g., objects) and even brightness, but not the hue. Hence, we say, the representation is invariant to hue variations. See Fabian Fuchs’ blog for a longer discussion of equivariance and invariance.

Using a masked image as one of the views means that we want a representation that is invariant to masking. There are two ways to do this:

Ignore any region that can be masked.
If a region is masked, try to predict what was there before masking.

Option 1. means encoding no information (representation collapse) and is usually incompatible with any good learning objective. That leaves option 2. and forces the model to reason about occluded parts. The masking model is trying to make 2. more difficult. Hence it learns to mask strongly-correlated groups of pixels, which often correspond to semantically-meaningful object parts, but do not necessarily correspond to objects—as discussed.

Summary

That’s it! If you got this far, you learned about visual blind spots and (hopefully) found your own, which gives you a pretty good idea how much inpainting our brains do. This is similar to masking and then inpainting images, which leads to some state-of-the-art representation learning. You also know that semantically-meaningful masks lead to even stronger results than random masks, and you’ve seen a couple of ways to get such masks.

So is pixel-level reconstruction the right way to go if you want to get good representations? While we do not have a definitive answer, we show through ADIOS that reconstructions are not always necessary, and that the motivation behind reconstruction-based MIM models does extend to the reconstruction-free setting.

If you’re interested in more details behind ADIOS, have a look at the paper, and play with the code! Here are a few things you could try:

Figure out how to learn masks for MAE without processing the whole image with the encoder, and perhaps with higher granularity than afforded by masking individual patches.
Experiment with stronger inductive biases for the masking model like slot-attention or GENESIS.

Acknowledgements

Huge thanks to Yuge Shi for doing most of the work behind the ADIOS paper. I would also like to thank Yuge Shi, Sandy Huang, Fabian Fuchs, Klaus Greff, and Sjoerd van Steenkiste for proofreading and providing helpful suggestions for this blog.

Footnotes

MAE, BEiT, SemMAE as well as our paper ADIOS, which is discussed further below. ↩
“Variational Autoencoder with Arbitrary Conditioning” by Ivanov et al. was the first paper that got me thinking about image inpainting. ↩
Sajjadi et al. show that novel-view synthesis helps to learn object segmentation in an unsupervised setting. A long shot and a topic for another blog, but I wonder if the blind-spot inpainting in the brain could help with object perception. ↩
I know this may sound unscientific and overly hyped. It is. I like this rhetoric, though. ↩
If the masked patches are used as input, the model has to learn to ignore them. Since MAE masks 75% of the image, there is probably no benefit to representing which areas of the image are masked (they are represented implicitly, since there is no contribution from the masked patches). By asking the model to learn-to-ignore, we are wasting model capacity while also risking falling into a local minimum where the masked patches are not totally ignored. Note that in transformers we can hardcode to ignore masked patches while feeding them as input, but this is more computationally-expensive and requires changing the implementation; for a convnet this may be impossible. ↩
It is unclear what the impact of such scattered masks is. It might force the model to reason about multiple things in every image. It may also reduce the variance of the gradients because total occlusion of a certain object is less likely with such scattered masks. ↩
Sjoerd van Steenkiste pointed out that there may be no such thing as a pixel representing a relation, e.g., no group of pixels may represent “heavier than” or even “bigger than”. While I agree, I’d like to note that masking pixels can obscure such relations. In case of “bigger than”, a mask can occlude a part of an object making its size difficult to determine. This may be useful for representation learning. ↩ ↩²
The latter could be perhaps circumvented by editing ground-truth masks, e.g., taking a union of two object masks, diluting or eroding masks, etc. ↩
I usually use “correlated” to describe such pixels, but as helpfully pointed out by Klaus Greff, this is wrong, because it relates to particular pixel values and not to random variables as such. Instead of “correlated”, it is more accurate to say that such pixels have high pointwise mutual information. “Predictable” here is a shorthand. ↩
See that, according to above, the background behaves just like a big object behind the objects in the foreground. ↩
The caveat is that using such learned masks requires feeding the whole image into the encoder. This results in a significantly increased computation cost for MAE and might not be practical. ↩
While I don’t like creating acronyms, I find that the currently available options are somewhat lacking. All representation learning algorithms we care about are unsupervised (self-supervised is unsupervised). The ones that require image reconstruction (inpainting, e.g., MAE) use one encoder and one decoder. The ones that do not require reconstruction (e.g., SimCLR) use two encoders and no decoder. The latter were called contrastive (but some methods do not use negative examples) and later self-supervised learning (SSL; but this is too broad since MAE is also SSL). Hence, I adopt “reconstruction-free learning (RFL)” to distinguish these two paradigms. An alternative that focuses on architecture would be “Siamese-Style Learning”—maybe this is better because it uses the same acronym? ↩

Machine Learning of Sets

Wed, 12 Aug 2020 10:15:00 +0000

In machine learning, we typically work with input pairs (x, y), and we try to figure out how x and y depend on each other. To do so, we gather many such pairs and hope that the dependence will reveal itself if a) we have enough data, b) our model is expressive enough to approximate this dependency, and c) we get the hyperparameters right. In the simplest case, both x and y are just scalar values (or vectors \(\mathbf{x}, \mathbf{y}\)); for example, given some measurements of a plant’s shape, we might want to predict its species. The measurements here are real vectors \(\mathbf{x} \in \mathcal{X}\), where the input space \(\mathcal{X} = \mathbb{R}^d\) is usually Euclidean, and the species is a label \(\mathbf{y} \in \mathcal{Y}\) (usually an integer or a one-hot vector), but it is common for \(\mathbf{x}\) and \(\mathbf{y}\) to have more structure.

One of the main assumptions we rely on is that the pairs of (x, y) points are independent and identically distributed (i.i.d.) random variables. Let us unpack this a bit, starting from the end,

random variable: there exists some stochastic generative process from which the variables were randomly sampled,
identically: all samples come from the same probability distribution,
independent: the generative process has no memory of generated samples, and hence any generated sample does not change the distribution over future generated samples.

Any structure in \(\mathbf{x}, \mathbf{y}\), or both introduces constraints, and a successful application of an algorithm to a particular problem does heavily depend on whether or not this algorithm takes the relevant constraints into account. A common constraint in image-related problems is translation equivariance¹—the output of the algorithm should shift with any shifts applied to the image (you can read more about equvariances in this excellent blog post). In natural language-related problems, a typical constraint is causality: a token at position t can depend on any previous tokens at position 1:t-1, but it cannot depend on any future tokens².

In the above examples, the dependencies between points (e.g., autoregressive dependence in NLP) are clear from the context. However, if a data point is not a vector, matrix, or a sequence of vectors, but it is a set of vectors, these dependencies become less clear. In particular, elements in an input set resemble elements in a dataset (i.e., lack of order), but the critical difference is that they are not independent, therefore breaking the i.i.d. assumption. Accounting for this specific structure in inputs or outputs of an ML model leads to a family of set learning problems, which have recently gained considerable attention in the machine learning community. I thought it would be useful to delve into the machine learning of sets. In the following, we will consider set-to-vector, vector-to-set, and set-to-set problems and provide implementations of simple algorithms in JAX and haiku.

First some imports:

import jax
import jax.numpy as jnp
import haiku as hk

Notation

Before we start, it is useful to introduce some notation. Let \(\mathbf{x} \in \mathbb{R}^d\) be an input vector, \(\mathbf{y} \in \mathbb{R}^k\) the output vector, and let \(X = \{\mathbf{x}_i\}_{i=1}^M\) and \(Y = \{\mathbf{y}_j\}_{j=1}^N\) be sets of \(M\) and \(N\) elements, respectively. Note that, until now, \(y\) or \(\mathbf{y}\) were simply labels. From now on, however, \(\mathbf{x}\) and \(\mathbf{y}\) can live in the same space, and simply be elements of different sets. I will also use \(\mathcal{L}(X, Y)\) as a loss function operating on two sets, and \(l(\mathbf{x}, \mathbf{y})\) will be a loss function for pairs of elements.

Set To Vector

This is perhaps the simplest set-learning problem since it only requires permutation invariance. A function \(f\) is invariant to permutations \(\pi\) if \(\forall \pi\): \(f(X) = f(\pi X)\). Permutation invariance has always been known in machine learning, as loss functions we use almost never³ depend on the ordering of elements in our datasets or minibatches. This is not for the lack of order: to create a minibatch, we stack multiple data elements in an array; this pairs every element in the minibatch with its minibatch index, therefore implicitly creating an order. Loss functions tend to discard information about the order, usually by taking the mean over data examples. We can create permutation-invariant functions by following a similar logic.

Examples in a minibatch are processed independently (which reflects their i.i.d. nature), but if each entry in the minibatch contains more than just a single data point (many pixels in an image, points in a point cloud, tokens in a language sentence), then flattening these points into a vector and feeding it into an MLP or a CNN results in different parameters being used for processing different data points, and hence order is used implicitly; feeding the points into an RNN reuses parameters, but introduces an explicit dependence on the order.

A straightforward solution to this issue is to treat points in a single example in the same way we treat examples in the minibatch: treat them independently. This approach, followed by a permutation-invariant pooling operation such as max or mean pooling, is explored in Zaheer et al., “Deep Sets”, NeurIPS 2017 and is proven to be a universal set-function approximator⁴.

class DeepSet(hk.Module):

  def __init__(self, encoder, decoder):
    super().__init__()
    self._encoder = encoder
    self._decoder = decoder
    
  def __call__(self, x):
    """Compute the DeepSet embedding.

    Args:
      x: Tensor of shape [batch_size, n_elems, n_dim].
    """
    return self._decoder(self._encoder(x).mean(1))

While newer approaches with better empirical performance exist, they all draw from the Deep Sets framework⁵. Another factor contributing to the fact that the set-to-vector problem is quite easy is that pooling operations naturally work with variable-sized sets–there is nothing extra we have to do to handle sets of variable cardinality. This is not the case in the following two problems, where we have to take the set size into account explicitly.

Vector To Set

In vector-to-set, the task is to generate a set of real vectors from some (usually vector-valued) conditioning.

The majority of approaches out there focus on generating ordered sequences instead of unordered sets, and usually of fixed or at least known size. This allows using MLPs⁶ and RNNs⁷ to predict fixed- and variable-length sets, respectively, but at the price of having to learn permutation-equivariance from data. Learning permutation-equivariance can be induced by data augmentation. It is easy to generate different permutations, but usually comes at a decreased performance and/or longer training times compared to truly permutation-equivariant methods⁸.

  def set_mlp(conditioning, decoder, n_elements):
    """Predicts a set.

    Args:
      conditioning: tensor of shape [batch_size, n_dim].
      decoder: callable, e.g. an MLP.
      n_elements: int.
    """
    z = decoder(conditioning)
    batch_size = conditioning.shape[0]
    # all we can do here is reshape!
    return z.reshape(batch_size, n_elements, -1)

  def set_rnn(conditioning, state, rnn, n_elements):
    """Predicts a set.

    Args:
      conditioning: tensor of shape [batch_size, n_dim].
      state: initial state for the rnn.
      rnn: rnn core.
      n_elements: int.
    """
    zs = []
    for _ in range(n_elements):
      z, state = rnn(conditioning, state)
      zs.append(z[:, None])  # add an axis

    return jnp.concatenate(zs, 1)

Permutation-Invariant Loss Functions

Learning to generate sets based on some conditioning typically requires scoring that set against the conditioning. If we have ground-truth sets at our disposal, we can compare the generated sets against the ground-truth ones for the same conditioning. This can take the form of supervised learning (think of detecting objects in an image, where we need to generate a set of bounding boxes) or unsupervised learning (autoencoding point-clouds, say). Since we generally have no guarantee that the generated sets will obey any ordering (why should they?), we have to apply losses invariant to that ordering. We have two options here:

We can find an optimal matching between two sets⁹, which comes down to finding a permutation \(\pi\) of one of the sets that minimizes the computed loss, that is: \(\pi^\star = \arg \min_\pi \mathcal{L}( \pi X, Y)\), with \(\mathcal{L}( \pi X, Y) = \sum_i l(\mathbf{x}_{\pi(i)}, \mathbf{y}_i)\). This can be done exactly using the cubic Hungarian matching algorithm, or approximately using e.g. optimal-transport- or message-passing-based algorithms.
Instead of finding a matching, we can find a lower bound on what the matched loss would be. A popular choice here is the Chamfer loss¹⁰, which computes \(\sum_{x \in X} \min_{y \in Y} l(x, y) + \sum_{y \in Y} \min_{x \in X} l(x, y)\). For every element in one set, it finds the element in the other set that results in the lowest pairwise loss. This loss does not work for multisets as elements can be repeated.

If we do not have ground-truth for each conditioning (we have just sets), or if we have many possible sets for each conditioning (e.g., a group of possible sets for one of a few labels), we can instead learn by matching distributions e.g., in the GAN setting. If we take this approach, we have two problems, really: that of vector-to-set for the generator and set-to-vector for the discriminator. Fortunately, we know how to solve the set-to-vector problem with a permutation-invariant neural net, and shortly I am going to describe some permutation-equivariant methods for generation. This is precisely what we recently explored in Stelzner et al., “Generative Adversarial Set Transformers”, ICML 2020 Object-Oriented Learning Workshop.

Coincidentally, sometimes we have to deal with a set of latent variables inside a model. For example in Attend-Infer-Repeat (AIR, paper, blog), a set of object-centered latent variables was used to render an image. We did not need to worry about permutations of these variables, though, since the rendering process was permutation-invariant, and any loss applied to the final image carried over to the latent variables in a permutation-invariant way, too!

Gradient Descent to the Rescue!

Until recently, there was no accepted method able to predict variable-sized sets in a permutation-equivariant manner. For completness, note that a function g is equivariant to permutations \(\pi\) if \(\forall \pi\): \(\pi g(X) = g(\pi X)\). Zhang et al., “Deep Set Prediction Networks”, NeurIPS 2019 used the well-known (but still pretty cool!) observation that the gradient of a permutation-invariant function (such as the DeepSet embedding) is permutation equivariant to the input set¹¹. Their introduced model, DSPN, uses a fixed initial set adapted via a nested loop of gradient-descent on a learned loss function. This loss function compares the currently-generated set and the conditioning, telling us how well the current set and the conditioning match. DSPN achieved quite good results on point-cloud generation (but only MNIST) and showed proof-of-concept results to object detection in images.

class DeepSetPredictionNetwork(hk.Module):

  def __init__(self, set_encoder, max_n_points, n_dim,
              n_updates=5, step_size=1., repr_loss_func):
    """Builds the module.

    Args:
      set_encoder: An encoder for sets, e.g. a DeepSet.
      max_n_points: an integer.
      n_dim: dimensionality of the set elements.
      n_updates: The number of gradient updates applied to the initial set.
      step_size: Learning rate for the inner gradient descent loop.
      repr_loss_func: A loss function used to compare the embedding of a
        generated set and an embedding of the conditioning, e.g. squared-error.
    """
    
    super().__init__()
    self._set_encoder = set_encoder
    self._max_n_points = max_n_points
    self._n_dim = n_dim
    self._n_updates = n_updates
    self._step_size = step_size

    self._clip_pres = lambda x: jnp.clip(x, 0., 1.)

    def repr_loss(inputs, target):
      h = self._set_encoder(*inputs)
      # We take a mean over the number of points.
      return repr_loss_func(h, target).mean(1).sum()

    self._repr_loss_grad = hk.grad(repr_loss)

  def __call__(self, z):     
    # create the initial set and presence variables
    current_set = hk.get_parameter('init_set',
                            shape=(self._max_n_points, self._n_dim),
                            init=hk.initializers.RandomUniform(0., 1.)
    )
    
    current_pres = self._clip_pres(hk.get_parameter('init_pres',
                                    shape=(self._max_n_points, 1),
                                    init=hk.initializers.Constant(.5),
    ))

    # DSPN returns the starting set/pres and apparently puts loss on it.
    all_sets, all_pres = [current_set], [current_pres]
    for _ in range(self._n_updates):
      set_grad, pres_grad = self._repr_loss_grad((current_set, current_pres), z)

      current_set = current_set - self._step_size * set_grad
      current_pres = current_pres - self._step_size * pres_grad
      # We need to make sure that the presence is valid after each update.
      current_pres = self._clip_pres(current_pres)

      all_sets.append(current_set)
      all_pres.append(current_pres)

    return all_sets, all_pres

Fig. 1: DSPN iteratively transforms an initial set (left) into the final prediction (2nd from the right) by gradient descent.

While a cool idea, the gradient iteration learned by DSPN is a flow field (see Fig. 1), and it necessarily requires many iterations to reach the final prediction. Instead, we can learn a permutation-equivariant operator that directly outputs the required set.

Attention is All You Need, Really

Not too long ago, Vaswani et al. showed that we could replace RNNs with attention, causal masking, and position embeddings. It turns out that discarding causal masking and position embeddings leads to self-attention that is permutation-equivariant, as explored in Lee et al., “Set Transformer”, ICML 2019. If this is the case, can we build a model similar to DSPN, but with a transformer instead of the inner gradient-descent inner loop? Of course, we can! There are several advantages:

The initial set can be higher-dimensional (in DSPN, it has to be the same dimensionality as the output set), leading to more degrees of freedom.
Transformer layers can operate on the set of different dimensionality, and they do not have to project it to the output dimensionality between layers. This might seem trivial, but it relaxes the flow-field constraint, and in practice, creates transformations that can hold on to some additional state, akin to RNNs.
DSPN captures dependencies between individual points only via a pooling operation in its DeepSet encoder. Transformers are all about relational reasoning, and can directly use interdependencies between points to generate the final set.

Fig. 2: TSPN uses a Transformer to directly transform a random point cloud.

We explored this idea in two recent papers; both published at the ICML 2020 Object-Oriented Learning workshop,

Kosiorek, Kim, and Rezende, “Conditional Set Generation with Transformers”, where we introduce the Transformer Set Prediction Network (TSPN). TSPN uses an MLP to predict the required number of points from a conditioning, samples the required number of points from a base distribution, and transforms them using a Transformer, see Fig. 2 for an overview.
Stelzner, Kersting, and Kosiorek, “Generative Adversarial Set Transformers” introduces GAST: a similar idea, where a number of points from a base distribution are conditionally-transformed (based on a global noise vector) using a Transormer. We then use a Set Transformer to discriminate between the generated and real sets.

The same idea was concurrently explored by at least two other groups¹². While details differ, the main finding is that an initial set (randomly-sampled or deterministic and learned) passed through several layers of attention leads to state-of-the-art set generation. The general architecture is as follows:

Some (big) neural net encoder for processing the conditioning, e.g., a ResNet for images.
The encoder produces some key-and-value vectors.
We take either a deterministic or randomly-sampled set of queries and attend over the key-and-value pairs.
The result might be post-processed by self-attention and/or point-wise MLPs.
We apply a permutation-invariant loss function, one of the described above. Hungarian matching seems to give the best results.

Fig. 3: Slot Attention induces competition between queries, leading to SOTA unsupervised object segmentation.

The results of Carion et al.’s DETR model are particularly impressive. While it still required quite a bit of engineering, this pure set-prediction approach achieves state-of-the-art on large-scale object detection on COCO! Locatello et al. show that the particular form of attention required might depend on the task; in their experiments, they normalize attention across the query axis (instead of the key axis), which leads to competition between queries, and provides superior results for unsupervised object segmentation (Fig. 3).

What about those Point Processes??!!

While the above approaches definitely work for generating sets, they make no use of the well-known area of statistics concerned with modeling sets: point processes! Point processes treat the set size \(k \in \mathbb{N}_+\) as a random variable and model it jointly with the set membership \(X \in \mathcal{X}^k\), thus modeling the joint density \(p(X, k)\). This is in contrast to some of the previously-described methods; e.g., DSPN uses heuristics to determine the set size, which does or does not work depending on which loss function it is used with (see our TSPN paper for details). Our TSPN is not much better in that regard, and casts determining the set size as a classification problem–this works quite well in practice, but it cannot generalize to set sizes not seen in training. While a detailed description of point process would take too much space to fit in this blog, I would like to highlight one notion, which I learned about from an excellent paper by Vu et al. called “Model-Based Multiple Instance Learning”.

Let \(f_k(X) = f_k(x_1, ..., x_i, ..., x_k)\) be a probability density function defined over sets of \(k\) elements, and let this density be invariant to ordering of the elements of the set, that is \(\forall \pi\): \(f(X) = f(\pi X)\). It turns out that we can use this density to compare sets of the same cardinality with each other in terms of how probable they are (i.e., how high their likelihood is), but, even if we have two such functions for sets of cardinality \(k\) and \(m\), we simply cannot use them to compare sets of those different cardinalities. Why is that? Well, comparing sets of two and sets of three elements is a bit like comparing square meters m\(^2\) and cubic meters m\(^3\), or like comparing apples and oranges. It is not that we cannot compare sets of different cardinality, but we have to first bring them into the same space, which in this case is dimension-less. To do that, we have to account for a) the number of possible permutations of each set, and b) the unit volume (in case of metric space and comparing m\(^2\) and m\(^3\), we need to figure out how big a meter m\(^1\) is). This leads to the following definition of the probability density function of a set of size \(k\),

\[p(\{x_1, ..., x_k\}) = p(X, k) = p_c(k)k!U^k f_k(x_1, ..., x_k)\,,\]

where \(p_c(k)\) is the probability mass function of the set size, \(k!\) accounts for all possible permutations of set elements \(\mathbf{x}_i\), \(U \in \mathbb{R}_+\) is the unit volume expressed as a scalar value, and \(f_k\) is the permutation-invariant density of a set of size k. Interestingly, none of the above set-generation papers take the point-process theory into account when defining their likelihoods over sets. I would be curious to see if it improves results, as Vu et al. suggest.

Set To Set

Given the knowledge of how to solve set-to-vector and vector-to-set problems, it should be quite clear how to solve a set-to-set problem: we can encode a set into a vector, and then decode that vector into a set using one of the above vector-to-set methods. While correct, this approach forces us to use a bottleneck in the shape of a single vector. Perhaps a better option is to encode a set to an intermediate set, possibly of smaller cardinality, and use that smaller set as conditioning when generating the output set. There are many methods of how this can be done, and I will only mention that we explored some such problems in Lee et al., “Set Transformer”, ICML 2019 and encourage curious readers to look at the paper.

Outlook and Conclusion

Thank you for reaching this far! We have covered some basics of set-oriented machine learning by taking a look at set-to-vector, vector-to-set, and set-to-set problems and some approaches to solving them. I find this area of ML incredibly interesting, for the variety of things that we consider in life as sets is endless. At the same time, the set-learning models tend to be both theoretically- and architecturally- interesting. Moving forward, I would like to see more models directly based on the point-process theory. Another area that I have not mentioned, and one that is extremely applicable, is that of normalizing flows. You can read about the basics of normalizing flows in my previous blog post, but in short, they are used to transform a simple probability distribution into a more complicated one. As such, there is nothing preventing us from using flows to transform a distribution over independent variables into a joint distribution over sets. While there are some papers that use this idea¹³ to define permutation-invariant likelihoods, none of them uses point-process theory. I will leave working out how to combine flows and point processes as an exercise to the reader, and I will be looking out for papers doing that :)

Stacked Capsule Autoencoders

Sun, 23 Jun 2019 16:15:00 +0000

Objects play a central role in computer vision and, increasingly, machine learning research. With many applications depending on object detection in images and videos, the demand for accurate and efficient algorithms is high. More generally, knowing about objects is essential for understanding and interacting with our environments. Usually, object detection is posed as a supervised learning problem, and modern approaches typically involve training a CNN to predict the likelihood of whether an object exists at a given image location (and maybe the corresponding class), see e.g. here.

While modern methods can achieve human-like performance in object recognition, they need to consume staggering amounts of data to do so. This is in stark contrast to mammals, who learn to recognize and localize objects with no supervision. It is difficult to say what exactly makes mammals so good at learning, but we can imagine that self-supervision¹ and inductive biases present in their sophisticated computing hardware (or rather ‘wetware’; that is brains) both play a huge role. These intuitions have led us to develop an unsupervised version of capsule networks, see Figure 1 for an overview, whose inductive biases give rise to object-centric latent representations, which are learned in a self-supervised way—simply by reconstructing input images. Clustering learned representations was enough to allow us to achieve unsupervised state-of-the-art classification performance on MNIST (98.5%) and SVHN (55%). In the remainder of this blog, I will try to explain what those inductive biases are, how they are implemented and what kind of things are possible with this new capsule architecture. I will also try to explain how this new version differs from previous versions of capsule networks.

Fig 1: The Stacked Capsule Autoencoder (SCAE) is composed of a Part Capsule Autoencoder (PCAE) followed by an Object Capsule Autoencoder (OCAE). It can decompose an image into its parts and group parts into objects.

Why do we care about equivariances?

I think it is fair to say that deep learning would not be so popular if not for CNNs and the 2012 AlexNet paper. CNNs learn faster than non-convolutional image models due to (1) local connectivity and (2) parameter sharing across spatial locations. The former restricts what can be learned, but is sufficient to learn correlations between nearby pixels, which turns out to be important for images. The latter makes learning easier since parameter updates benefit from more signal. It also results in translation equivariance, which means that, when the input to a CNN is shifted, the output is shifted by an equal amount, while remaining unchanged otherwise. Formally, a function \(f(\mathbf{x})\) is equivariant to any transformation \(T \in \mathcal{T}\) if \(\forall_{T \in \mathcal{T}} Tf(\mathbf{x}) = f(T\mathbf{x})\). That is, applying any transformation to the input of the function has the same effect as applying that transformation to the output of the function. Invariance is a related notion, and the function \(f\) is invariant if \(\forall_{T \in \mathcal{T}} f(\mathbf{x}) = f(T\mathbf{x})\)—applying transformations to the input does not change the output.

Being equivariant helps with learning and generalization—for example, a model does not have to see the object placed at every possible spatial location in order to learn how to classify it. For this reason, it would be great to have neural nets that are equivariant to other affine degrees of freedom like rotation, scale, and shear, but this is not very easy to achieve, see e.g. group equivariant conv nets.

Equivariance to different transformations can be learned approximately, but it requires vast data augmentation and a considerably higher training cost. Augmenting data with random crops or shifts helps even with training translation-equivariant CNNs since these are typically followed by fully-connected layers, which have to learn to handle different positions. Other affine transformations² are not easy to augment with, as they would require access to full three-dimensional scene models. Even if scene models were available, we would need to augment data with combinations of different transformations, which would result in an absolutely enormous dataset. The problem is exacerbated by the fact that objects are often composed of parts, and it would be best to capture all possible configurations of object parts. Spatial Transformer Networks and this followup work provide one way of learning affine equivariances but do not address the fact that objects can undergo local transformations.

Capsules learn equivariant object representations

Fig 2: Capsule networks work by inferring parts & their poses from an image, and then using parts and poses to reason about objects.

Instead of building models that are globally equivariant to affine transformations, we can rely on the fact that scenes often contain many complicated objects, which in turn are composed of simpler parts. By definition, parts exhibit less variety in their appearance and shape than full objects, and consequently, they should be easier to learn from raw pixels. Objects can then be recognized from parts and their poses, given that we can learn how parts come together to form different objects, as in Figure 2. The caveat here is that we still need to learn a part detector, and it needs to predict part poses (i.e. translation, rotation, and scale) too. We hypothesize that this should be much simpler than learning an end-to-end object detector with similar capabilities.

Since poses of any entities present in a scene change with the location of an observer (or rather the chosen coordinate system), then a detector that can correctly identify poses of parts produces a viewpoint-equivariant part representation. Since object-part relationships do not depend on the particular vantage point, they are viewpoint-invariant. These two properties, taken together, result in viewpoint-equivariant object representations.

The issue with the above is that the corresponding inference process, that is, using previously discovered parts to infer objects, is difficult since every part can belong to at most one object. Previous versions of capsules solved this by iteratively refining the assignment of parts to objects (also known as routing). This proved to be inefficient in terms of both computation and memory and made it impossible to scale to bigger images. See e.g. here for an overview of previous capsule networks.

Can an arbitrary neural net learn capsule-like representations?

Original capsules are a type of a feed-forward neural network with a specific structure and are trained for classification. Incidentally, we know that classification corresponds to inference, which is the inverse process of generation, and as such is more difficult. To see this, think about Bayes’ rule: this is why posterior distributions are often much more complicated than the prior or likelihood terms³.

Instead, we can use the principles introduced by capsules to build a generative model (decoder) and a corresponding inference network (encoder). Generation is simpler, since we can have any object generate arbitrarily many parts, and we do not have to deal with constraints encountered in inference. Complicated inference can then be left to your favorite neural network, which is going to learn appropriate latent representations. Since the decoder uses capsules machinery, it is viewpoint equivariant by design. It follows that the encoder has to learn representations that are also viewpoint equivariant, at least approximately.

A potential disadvantage is that, even though the latent code might be viewpoint equivariant, in the sense that it explicitly encodes object coordinates, the encoder itself need not be viewpoint equivariant. This means that, if the model sees an object from a very different point of view than it is used to, it may fail to recognize the object. Interestingly, this seems to be in line with human perception, as noted recently by Geoff Hinton in his Turing Award lecture, where he uses a thought experiment to illustrate this. If you are interested, you can watch the below video for about 2.5 minutes.

Here is a simplified version of the example in the video, see Figure 3: imagine a square, and tilt it by 45 degrees, look away, and look at it again. Can you see a square? Or does the shape resemble a diamond? Humans tend to impose coordinate frames on the objects they see, and the coordinate frame is one of the features that let us recognize the objects. If the coordinate frame is very different from the usual one, we may have problems recognizing the correct shape.

Fig 3: Is it a square, or is it a diamond?

Why bother?

I hope I have managed to convince you that learning capsule-like representations is possible. Why is it a good idea? While we still have to scale the method to complicated real-world imagery, the initial results are quite promising. It turns out that the object capsules can learn to specialize to different types of objects. When we clustered the presence probabilities of object capsules we found that, to our surprise, representations of objects from the same class are grouped tightly together. Simply looking up the label that examples in a given cluster correspond to resulted in state-of-the-art unsupervised classification accuracy on two datasets: MNIST (98.5%) and SVHN (55%). We also took a model trained on MNIST and simulated unseen viewpoints by performing affine transformations of the digits, and also achieved state-of-the-art unsupervised performance (92.2%), which shows that learned representations are in fact robust to viewpoint changes. Results on Cifar10 are not quite as good, but still promising. In future work, we are going to explore more expressive approaches to image reconstruction, instead of using fixed templates, and hopefully, scale up to more complicated data.

Technical bits

This is the end of high-level intuitions, and we now proceed to some technical descriptions, albeit also high-level ones. This might be a good place to stop reading if you are not into that sort of thing.

In the following, I am going to describe the decoding stack of our model, and you can think of it as a (deterministic) generative model. Next, I will describe an inference network, which provides the capsule-like representations.

How can we turn object prototypes into an image?

Let us start by defining what a capsule is.

We define a capsule as a specialized part of a model that describes an abstract entity, e.g. a part or an object. In the following, we will have object capsules, which recognize objects from parts, and part capsules, which extract parts and poses from an input image. Let capsule activation be a group of variables output by a single capsule. To describe an object, we would like to know (1) whether it exists, (2) what it looks like and (3) where it is located⁴. Therefore, for an object capsule \(k\), its activations consist of (1) a presence probability \(a_k\), (2) a feature vector \(\mathbf{c_k}\), and (3) a \(3\times 3\) pose matrix \(OV_k\), which represents the geometrical relationship between the object and the viewer (or some central coordinate system). Similar features can be used to describe parts, but we will simplify the setup slightly and assume that parts have a fixed appearance and only their pose can vary. Therefore, for the \(m^\mathrm{th}\) part capsule, its activations consist of the probability \(d_m\) that the part exists, and a \(6\)-dimensional⁵ pose \(\mathbf{x}_m\), which represents the position and orientation of the given part in the image. As indicated by notation, there are several object capsules and potentially many part capsules.

Since every object can have several parts, we need a mechanism to turn an object capsule activation into several part capsule activations. To this end, for every object capsule, we learn a set of \(3\times 3\) transformation matrices \(OP_{k,m}\), representing the geometrical relationship between an object and its parts. These matrices are encouraged to be constant, although we do allow a weak dependence on the object features \(\mathbf{c}_k\) to account for small deformations. Since any part can belong to only one object, we gather predictions from all object capsules corresponding to the same part capsule and arrange them into a mixture. If the model is confident that a particular object should be responsible for a given part, then this will be reflected in the mixing probabilities of the mixture. In this case, sampling from the mixture will be similar to just taking argmax over the mixing proportions while also accounting for uncertainty in the assignment. Finally, we explain parts by independent Gaussian mixtures; this is a simplifying assumption saying that a choice of a parent for one part should not influence the choice of parents for other parts.

Fig 4: Left: learned parts, or templates. Center: a few affine-transformed parts; they do not comprise full objects. Right: MNIST digits assembled from the transformed parts.

Having generated part poses, we can take parts, apply affine transformations parametrized by the corresponding poses as in Spatial Transformer, and assemble the transformed parts into an image (Figure 4). But wait—we need to get the parts first! Since we assumed that parts have fixed appearance, we are going to learn a bank of fixed parts by gradient descent. Each part is like an image, just smaller, and can be seen as a “template”. To give an example, a good template for MNIST digits would be a stroke, like in the famous paper from Lake et. al. or the left-hand side of Figure 4.

Where do we get capsule parameters from?

Above, we define a generative process that can transform object and part capsule activations into images. But to obtain capsule activations describing a particular image, we need to run some sort of inference. In this case, we will just use neural networks to amortize inference, like in a VAE, (see this post for more details on VAEs). In other words, neural nets will predict capsule activations directly from the image. We will do this in two stages. Firstly, given the image, we will have a neural net predict pose parameters and presence probabilities for every part from our learnable bank of parts. Secondly, a separate neural net will look at the part parameters and will try to directly predict object capsule activations. These two stages correspond to two stages of the generative process we outlined; we can now pair each of the stages with the corresponding generative stage and arrive at two autoencoders. The first one, Part Capsule Autoencoder (PCAE), detects parts and recombines them into an image. The second one, Object Capsule Autoencoder (OCAE), organizes parts into objects. Below, we describe their architecture and some of our design choices.

Inferring parts and poses

Fig 5: The Part Capsule Autoencoder (PCAE) detects parts and their poses from the image and reconstructs the image by directly assembling it from affine-transformed parts.

The PCAE uses a CNN-based encoder, but with tweaks. Firstly, notice that for \(M\) parts, we need \(M \times (6 + 1)\) predicted parameters. That is, for every part we need \(6\) parameters of an affine transformation \(\mathbf{x}_m\) (we work in two dimensions) and a probability \(d_m\) of the part being present; we could also predict some additional parameters; in fact we do, but we omit the details here for clarity—see the paper for details. It turns out that using a fully-connected layer after a CNN does not work well here; see the paper for details. Instead, we project the outputs of the CNN to \(M \times (6 + 1 + 1)\) feature maps using \(1\times 1\) convolutions, where we added an extra feature map for each part capsule. This extra feature map will serve as an attention mask: we normalize it spatially via a softmax, multiply with the remaining 7 feature maps, and sum each dimension independently across spatial locations. This is similar to global-average pooling, but allows the model to focus on a specific location; we call it attention-based pooling.

We can then use part presence probabilities and part poses to select and affine-transform learned parts, and assemble them into an image. Every transformed part is treated as a spatial Gaussian mixture component, and we train PCAE by maximizing the log-likelihood under this mixture.

Organizing parts into objects

Fig 6: The Object Capsule Autoencoder (OCAE) tries to explain part poses as a sparse set of objects, where every present object predicts several parts. It automatically discovers structure in the data, whereby different object capsules specialise to different objects.

Knowing what parts there are in an image and where they are might be useful, but in the end, we care about the objects that they belong to. Previous capsules used an EM-based inference procedure to make parts vote for objects. This way, each part could start by initially disagreeing and voting on different objects, but eventually, the votes would converge to a set of only a few objects. We can also see inference as compression, where a potentially large set of parts is explained by a potentially very sparse set of objects. Therefore, we try to predict object capsule activations directly from the poses and presence probabilities of parts. The EM-based inference tried to cluster part votes around objects. We follow this intuition and use the Set Transformer with \(K\) outputs to encode part activations. Set Transformer has been shown to work well for amortized-clustering-type problems, and it is permutation invariant. Part capsule activations describe parts, not pixels, which can have arbitrary positions in the image, and in that sense have no order. Therefore, set-input neural networks seem to be a better choice than MLPs—a hypothesis corroborated by an ablation study we have in the paper.

Each output of the Set Transformer is fed into a separate MLP, which then outputs all activations for the corresponding object capsule. We also use a number of sparsity losses applied to the object presence probabilities; these are necessary to make object capsules specialize to different types of objects, please see the paper for details. The OCAE is trained by maximizing the likelihood of part capsule activations under a Gaussian mixture of predictions from object capsules, subject to sparsity constraints.

Summary

Fig 7: The Stacked Capsule Autoencoder (SCAE) is composed of a PCAE followed by an OCAE. It can decompose image into its parts and group parts into objects.

In summary, a Stacked Capsule Autoencoder is composed of:

the PCAE encoder: a CNN with attention-based pooling,
the OCAE encoder: a Set Transformer,
the OCAE decoder:
- \(K\) MLPs, one for every object capsule, which predicts capsule parameters from Set Transformer’s outputs,
- \(K \times M\) constant \(3 \times 3\) matrices representing constant object-part relationships,
and the PCAE decoder, which is just \(M\) constant part templates, one for each part capsule.

SCAE defines a new method for representation learning, where an arbitrary encoder learns viewpoint-equivariant representations by inferring parts and their poses and groups them into objects. This post provides motivation as well as high-level intuitions behind this idea, and an overview of the method. The major drawback of the method, as of now, is that the part decoder uses fixed templates, which are insufficient to model complicated real-world images. This is also an exciting avenue for future work, together with deeper hierarchies of capsules and extending capsule decoders to three-dimensional geometry. If you are interested in the details, I would encourage you to read the original paper: A. R. Kosiorek, S. Sabour, Y.W. Teh and G. E. Hinton, “Stacked Capsule Autoencoders”, arXiv 2019.

Forge, or how do you manage your machine learning experiments?

Wed, 28 Nov 2018 10:15:00 +0000

Every time I begin a machine learning (ML) project, I go through more or less the same steps. I start by quickly hacking a model prototype and a training script. After a few days, the codebase grows unruly and any modification is starting to take unreasonably long time due to badly-handled dependencies and the general lack of structure. At this point, I decide that some refactoring is needed: parts of the model are wrapped into separate, meaningful objects, and the training script gets somewhat general structure, with clearly delineated sections. Further down the line, I am often faced with the need of supporting multiple datasets and a variety of models, where the differences between model variants are much more than just hyperparameters - they often differ structurally and have different inputs or outputs. At this point, I start copying training scripts to support model variants. It is straightforward to set up, but maintenance becomes a nightmare: with copies of the code living in separate files, any modification has to be applied to all the files.

For me, it is often unclear how to handle this last bit cleanly. It can be project-dependent. It is often easy to come up with simple hacks, but they do not generalise and can make code very messy very quickly. Given that most experiments look similar among the projects I have worked on, there should exist a general solution. Let’s have a look at the structure of a typical experiment:

You specify the data and corresponding hyperparameters.
You define the model and its hyperparameters.
You run the training script and (hopefully) save model checkpoints and logs during training.
Once the training has converged, you might want to load a model checkpoint in another script or a notebook for thorough evaluation or deploy the model.

In most projects I have seen, 1. and 2. were split between the training script (dataset and model classes or functions) and external configs (hyperparameters as command-line arguments or config files). Logging and saving checkpoints should be a part of every training script, and yet it can be time-consuming to set up correctly. As far as I know, there is no general mechanism to do 4., and it is typically handled by retrieving hyperparameters used in a specific experiment and using the dataset/model classes/functions directly to instantiate them in a script or a notebook.

If this indeed is a general structure of an experiment, then there should exist tools to facilitate it. I am not familiar with any, however. Please let me know if such tools exist, or if the structure outlined above does not generally hold. Sacred and artemis are great for managing configuration files and experimental results; you can retrieve configuration of an experiment, but if you want to load a saved model in a notebook, for example, you need to know how to instantiate the model using the config. I prefer to automate this, too. When it comes to tensorflow, there is keras and the estimator api that simplify model building, fitting and evaluation. While generally useful, they are rather heavy and make access to low-level model features difficult. Their lack of flexibility is a no-go for me since I often work on non-standard models and require access to the most private of their parts.

All this suggests that we could benefit from a lightweight experimental framework for managing ML experiments. For me, it would be ideal if it satisfied the following requirements.

It should require minimal setup.
It has to be compatible with tensorflow (my primary tool for ML these days).
Ideally, it should be usable with non-tensorflow models - software evolves quickly, and my next project might be in pytorch. Who knows?
Datasets and models should be specified and configured separately so that they can be mixed and matched later on.
Hyerparameters and config files should be stored for every experiment, and it would be great if we could browse them quickly, without using non-standard apps to do so (so no databases).
Loading a trained model should be possible with minimum overhead, ideally without touching the original model-building code. Pointing at a specific experiment should be enough.

As far as I know, such a framework does not exist. So how do I go about it? Since I started my master thesis at BRML, I have been developing tools, including parts of an experimental framework, that meet some of the above requirements. However, for every new project I started, I would copy parts of the code responsible for running experiments from the previous project. After doing that for five different projects (from HART to SQAIR), I’ve had enough. When I was about to start a new project last week, I’ve taken all the experiment-running code, made it project-agnostic, and put it into a separate repo, wrote some docs, and gave it a name. Lo and behold: Forge.

Forge

While it is very much work in progress, I would like to show you how to set up your project using forge. Who knows, maybe it can simplify your workflow, too?

Configs

Configs are perhaps the most useful component in forge. The idea is that we can specify an arbitrarily complicated config file as a python function, and then we can load it using forge.load(config_file, *args, **kwargs), where config_file is a path on your filesystem. The convention is that the config file should define load function with the following signature: load(config, *args, **kwargs). The arguments and kw-args passed to forge.load are automatically forwarded to the load function in the config file. Why would you load a config by giving its file path? To make code maintenance easier! Once you write config-loading code in your training/experimentation scripts, it is best not to touch it anymore. But how do you swap config files? Without touching the training script: If we specify file paths as command-line arguments, then we can do it easily. Here’s an example. Suppose that our data config file data_config.py is the following:

from tensorflow.examples.tutorials.mnist import input_data

def load(config):

  # The `config` argument is here unused, but you can treat it
  # as a dict of keys and values accessible as attributes - it acts
  # like an AttrDict

  dataset = input_data.read_data_sets('.')  # download MNIST
  # to the current working dir and load it
  return dataset

Our model file defines a simple one-layer fully-connected neural net, classification loss and some metrics in model_config.py. It can read as follows.

import sonnet as snt
import tensorflow as tf

from forge import flags

flags.DEFINE_integer('n_hidden', 128, 'Number of hidden units.')

def process_dataset(x):
  pass
  # this function should return a minibatch, somehow

def load(config, dataset):

    imgs, labels = process_dataset(dataset)

    imgs = snt.BatchFlatten()(imgs)
    mlp = snt.nets.MLP([config.n_hidden, 10])
    logits = mlp(imgs)
    labels = tf.cast(labels, tf.int32)

    # softmax cross-entropy
    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels))

    # predicted class and accuracy
    pred_class = tf.argmax(logits, -1)
    acc = tf.reduce_mean(tf.to_float(tf.equal(tf.to_int32(pred_class), labels)))

    # put here everything that you might want to use later
    # for example when you load the model in a jupyter notebook
    artefacts = {
        'mlp': mlp,
        'logits': logits,
        'loss': loss,
        'pred_class': pred_class,
        'accuracy': acc
    }

    # put here everything that you'd like to be reported every N training iterations
    # as tensorboard logs AND on the command line
    stats = {'crossentropy': loss, 'accuracy': acc}

    # loss will be minimised with respect to the model parameters
    return loss, stats, artefacts

Now we can write a simple script called experiment.py that loads some data and model config files and does useful things with them.

from os import path as osp

import tensorflow as tf

import forge
from forge import flags


# job config
flags.DEFINE_string('data_config', 'data_config.py', 'Path to a data config file.')
flags.DEFINE_string('model_config', 'model_config.py', 'Path to a model config file.')
flags.DEFINE_integer('batch_size', 32, 'Minibatch size used for training.')


config = forge.config()  # parse command-line flags
dataset = forge.load(config.data_config, config)
loss, stats, stuff = forge.load(config.model_config, config, dataset)

# ...
# do useful stuff

Here’s the best part. You can just run python experiment.py to run the script with the config files given above. But if you would like to run a different config, you can execute python experiment.py --data_config some/config/file/path.py without touching experimental code. All this is very lightweight, as config files can return anything and take any arguments you find necessary.

Smart checkpoints

Given that we have very general and flexible config files, it should be possible to abstract away model loading. It would be great, for instance, if we could load a trained model snapshot without pointing to the config files (or model-building code, generally speaking) used to train the model. We can do it by storing config files with model snapshots. It can significantly simplify model evaluation and deployment and increase reproducibility of our experiments. How do we do it? This feature requires a bit more setup than just using config files, but bear with me - it might be even more useful.

The smart checkpoint framework depends on the following folder structure.

results_dir
  |run_name
    |1
    |2
    |...
    |<integer>  # number of the current run

results_dir is the top-level directory containing potentially many experiment-specific folders, where every experiment has a separate folder denoted by run_name. We might want to re-run a specific experiment, and for this reason, every time we run it, forge creates a folder, whose name is an integral number - the number of this run. It starts at one and gets incremented every time we start a new run of the same experiment. Instead of starting a new run, we can also resume the last one by passing a flag. In this case, we do not create a new folder for it, but use the highest-numbered folder and load the latest model snapshot.

First, we need to import forge.experiment_tools and define the following flags.

from os import path as osp
from forge import experiment_tools as fet

flags.DEFINE_string('results_dir', '../checkpoints', 'Top directory for all experimental results.')
flags.DEFINE_string('run_name', 'test_run', 'Name of this job. Results will be stored in a corresponding folder.')
flags.DEFINE_boolean('resume', False, 'Tries to resume a job if True.')

We can then parse the flags and initialise our checkpoint.

config = forge.config()  # parse flags

# initialize smart checkpoint
logdir = osp.join(config.results_dir, config.run_name)
logdir, resume_checkpoint = fet.init_checkpoint(logdir, config.data_config, config.model_config, config.resume)

fet.init_checkpoint does a few useful things:

Creates the directory structure mentioned above.
Copies the data and model config files to the checkpoint folder.
Stores all configuration flags and the hash of the current git commit (if we’re in a git repo, very useful for reproducibility) in flags.json, or restores flags if restore was True.
Figures out whether there exists a model snapshot file that should be loaded.

logdir is the path to our checkpoint folder and evaluates to results_dir/run_name/<integer>. resume_checkpoint is a path to a checkpoint if resume was True, typically results_dir/run_name/<integer>/model.ckpt-<maximum global step>, or None otherwise.

Now we need to use logdir and resume_checkpoint to store any logs and model snapshots. For example:

...  # load data/model and do other setup

# Try to restore the model from a checkpoint
saver = tf.train.Saver(max_to_keep=10000)
if resume_checkpoint is not None:
    print "Restoring checkpoint from '{}'".format(resume_checkpoint)
    saver.restore(sess, resume_checkpoint)

...
  # somewhere inside the train loop
  saver.save(sess, checkpoint_name, global_step=train_itr)
...

If we want to load our model snapshot in another script, eval.py, say, we can do so in a very straightforward manner.

import tensorflow as tf
from forge import load_from_checkpoint

checkpoint_dir = '../checkpoints/mnist/1'  
checkpoint_iter = int(1e4)

# `data` contains any outputs of the data config file
# `model` contains any outputs of the model config file
data, model, restore_func = load_from_checkpoint(checkpoint_dir, checkpoint_iter)

# Calling `restore_func` restores all model parameters
sess = tf.Session()
restore_func(sess)

...  # do exciting stuff with the model

Working Example

Code for forge is available at github.com/akosiorek/forge and a working example is described in the README.

Closing thoughts

Even though experimental code exhibits very similar structure among experiments, there seem to be no tools to streamline the experimentation process. This requires ML practitioners to write thousands of lines of boilerplate code, contributes to many errors and generally slows down research progress. forge is my attempt at introducing some good practices as well as simplifying the process. Hope you can take something from it for your own purposes.

Acknowledgements

I would like to thank Alex Bewley for inspiration, Adam Goliński for discussions about software engineering in ML and Martin Engelcke for his feedback on forge.

Normalizing Flows

Tue, 03 Apr 2018 09:43:00 +0000

Machine learning is all about probability. To train a model, we typically tune its parameters to maximise the probability of the training dataset under the model. To do so, we have to assume some probability distribution as the output of our model. The two distributions most commonly used are Categorical for classification and Gaussian for regression. The latter case can be problematic, as the true probability density function (pdf) of real data is often far from Gaussian. If we use the Gaussian as likelihood for image-generation models, we end up with blurry reconstructions. We can circumvent this issue by adversarial training, which is an example of likelihood-free inference, but this approach has its own issues.

Gaussians are also used, and often prove too simple, as the pdf for latent variables in Variational Autoencoders (VAEs), which I describe in my previous post. Fortunately, we can often take a simple probability distribution, take a sample from it and then transform the sample. This is equivalent to change of variables in probability distributions and, if the transformation meets some mild conditions, can result in a very complex pdf of the transformed variable. Danilo Rezende formalised this in his paper on Normalizing Flows (NF), which I describe below. NFs are usually used to parametrise the approximate posterior \(q\) in VAEs but can also be applied for the likelihood function.

Change of Variables in Probability Distributions

We can transform a probability distribution using an invertible mapping (i.e. bijection). Let \(\mathbf{z} \in \mathbb{R}^d\) be a random variable and \(f: \mathbb{R}^d \mapsto \mathbb{R}^d\) an invertible smooth mapping. We can use \(f\) to transform \(\mathbf{z} \sim q(\mathbf{z})\). The resulting random variable \(\mathbf{y} = f(\mathbf{z})\) has the following probability distribution:

\[q_y(\mathbf{y}) = q(\mathbf{z}) \left| \mathrm{det} \frac{ \partial f^{-1} }{ \partial \mathbf{z}\ } \right| = q(\mathbf{z}) \left| \mathrm{det} \frac{ \partial f }{ \partial \mathbf{z}\ } \right| ^{-1}. \tag{1}\]

We can apply a series of mappings \(f_k\), \(k \in {1, \dots, K}\), with \(K \in \mathbb{N}_+\) and obtain a normalizing flow, first introduced in Variational Inference with Normalizing Flows,

\[\mathbf{z}_K = f_K \circ \dots \circ f_1 (\mathbf{z}_0), \quad \mathbf{z}_0 \sim q_0(\mathbf{z}_0), \tag{2}\] \[\mathbf{z}_K \sim q_K(\mathbf{z}_K) = q_0(\mathbf{z}_0) \prod_{k=1}^K \left| \mathrm{det} \frac{ \partial f_k }{ \partial \mathbf{z}_{k-1}\ } \right| ^{-1}. \tag{3}\]

This series of transformations can transform a simple probability distribution (e.g. Gaussian) into a complicated multi-modal one. To be of practical use, however, we can consider only transformations whose determinants of Jacobians are easy to compute. The original paper considered two simple family of transformations, named planar and radial flows.

Simple Flows

Planar Flow

\(f(\mathbf{z}) = \mathbf{z} + \mathbf{u} h(\mathbf{w}^T \mathbf{z} + b), \tag{4}\)

with \(\mathbf{u}, \mathbf{w} \in \mathbb{R}^d\) and \(b \in \mathbb{R}\) and \(h\) an element-wise non-linearity. Let \(\psi (\mathbf{z}) = h' (\mathbf{w}^T \mathbf{z} + b) \mathbf{w}\). The determinant can be easily computed as

\[\left| \mathrm{det} \frac{\partial f}{\partial \mathbf{z}} \right| = \left| 1 + \mathbf{u}^T \psi( \mathbf{z} ) \right|. \tag{5}\]

We can think of it as slicing the \(\mathbf{z}\)-space with straight lines (or hyperplanes), where each line contracts or expands the space around it, see figure 1.

Radial Flow

\[f(\mathbf{z}) = \mathbf{z} + \beta h(\alpha, r)(\mathbf{z} - \mathbf{z}_0), \tag{6}\]

with \(r = \Vert\mathbf{z} - \mathbf{z}_0\Vert_2\), \(h(\alpha, r) = \frac{1}{\alpha + r}\) and parameters \(\mathbf{z}_0 \in \mathbb{R}^d, \alpha \in \mathbb{R}_+\) and \(\beta \in \mathbb{R}\).

Similarly to planar flows, radial flows introduce spheres in the \(\mathbf{z}\)-space, which either contract or expand the space inside the sphere, see figure 1.

Discussion

These simple flows are useful only for low dimensional spaces, since each transformation affects only a small volume in the original space. As the volume of the space grows exponentially with the number of dimensions \(d\), we need a lot of layers in a high-dimensional space.

Another way to understand the need for many layers is to look at the form of the mappings. Each mapping behaves as a hidden layer of a neural network with one hidden unit and a skip connection. Since a single hidden unit is not very expressive, we need a lot of transformations. Recently introduced Sylvester Normalising Flows overcome the single-hidden-unit issue of these simple flows; for more details please read the paper.

Simple flows are useful for sampling, e.g. as parametrisation of \(q(\mathbf{z})\) in VAEs, but it is very difficult to evaluate probability of a data point that was not sampled from it. This is because the functions \(h\) in planar and radial flow are invertible only in some regions of the \(\mathbf{z}\)-space, and the functional form of their inverse is generally unknown. Please drop a comment if you have an idea how to fix that.

Fig 1. The effect of planar and radial flows on the Gaussian and uniform distributions. The figure comes from the original paper.

Autoregressive Flows

Enhancing expressivity of normalising flows is not easy, since we are constrained by functions, whose Jacobians are easy to compute. It turns out, though, that we can introduce dependencies between different dimensions of the latent variable, and still end up with a tractable Jacobian. Namely, if after a transformation, the dimension \(i\) of the resulting variable depends only on dimensions \(1:i\) of the input variable, then the Jacobian of this transformation is triangular. As we know, a determinant of a triangular matrix is equal to the product of the terms on the diagonal. More formally, let \(J \in \mathcal{R}^{d \times d}\) be the Jacobian of the mapping \(f\), then

\[y_i = f(\mathbf{z}_{1:i}), \qquad J = \frac{\partial \mathbf{y}}{\partial \mathbf{z}}, \tag{7}\] \[\det{J} = \prod_{i=1}^d J_{ii}. \tag{8}\]

I would like to draw your attention to three interesting flows that use the above observation, albeit in different ways, and arrive at mappings with very different properties.

Real Non-Volume Preserving Flows (R-NVP)

R-NVPs are arguably the least expressive but the most generally applicable of the three. Let \(1 < k < d\), \(\circ\) element-wise multiplication and \(\mu\) and \(\sigma\) two mappings \(\mathcal{R}^k \mapsto \mathcal{R}^{d-k}\) (Note that \(\sigma\) is not the sigmoid function). R-NVPs are defined as:

\[\mathbf{y}_{1:k} = \mathbf{z}_{1:k},\\ \mathbf{y}_{k+1:d} = \mathbf{z}_{k+1:d} \circ \sigma(\mathbf{z}_{1:k}) + \mu(\mathbf{z}_{1:k}). \tag{9}\]

It is an autoregressive transformation, although not as general as equation (7) allows. It copies the first \(k\) dimensions, while shifting and scaling all the remaining ones. The first part of the Jacobian (up to dimension \(k\)) is just an identity matrix, while the second part is lower-triangular with \(\sigma(\mathbf{z}_{1:k})\) on the diagonal. Hence, the determinant of the Jacobian is

\[\frac{\partial \mathbf{y}}{\partial \mathbf{z}} = \prod_{i=1}^{d-k} \sigma_i(\mathbf{z}_{1:k}). \tag{10}\]

R-NVPs are particularly attractive, because both sampling and evaluating probability of some external sample are very efficient. Computational complexity of both operations is, in fact, exactly the same. This allows to use R-NVPs as a parametrisation of an approximate posterior \(q\) in VAEs, but also as the output likelihood (in VAEs or general regression models). To see this, first note that we can compute all elements of \(\mu\) and \(\sigma\) in parallel, since all inputs (\(\mathbf{z}\)) are available. We can therefore compute \(\mathbf{y}\) in a single forward pass. Next, note that the inverse transformation has the following form, with all divisions done element-wise,

\[\mathbf{z}_{1:k} = \mathbf{y}_{1:k},\\ \mathbf{z}_{k+1:d} = (\mathbf{y}_{k+1:d} - \mu(\mathbf{y}_{1:k}))~/~\sigma(\mathbf{y}_{1:k}). \tag{11}\]

Note that \(\mu\) and \(\sigma\) are usually implemented as neural networks, which are generally not invertible. Thanks to equation (11), however, they do not have to be invertible for the whole R-NVP transformation to be invertible. The original paper applies several layers of this mapping. The authors also reverse the ordering of dimensions after every step. This way, variables that are just copied in one step, are transformed in the following step.

Autoregressive Transformation

We can be even more expressive than R-NVPs, but we pay a price. Here’s why.

Now, let \(\mathbf{\mu} \in \mathbb{R}^d\) and \(\mathbf{\sigma} \in \mathbb{R}^d_+\). We can introduce complex dependencies between dimensions of the random variable \(\mathbf{y} \in \mathbb{R}^d\) by specifying it in the following way.

\[y_1 = \mu_1 + \sigma_1 z_1 \tag{12}\] \[y_i = \mu (\mathbf{y}_{1:i-1}) + \sigma (\mathbf{y}_{1:i-1}) z_i \tag{13}\]

Since each dimension depends only on the previous dimensions, the Jacobian of this transformation is a lower-triangular matrix with \(\sigma (\mathbf{z}_{1:i-1})\) on the diagonal; the determinant is just a product of the terms on the diagonal. We might be able to sample \(\mathbf{z} \sim q(\mathbf{z})\) in parallel (if different dimensions are i.i.d.), but the transformation is inherently sequential. We need to compute all \(\mathbf{y}_{1:i-1}\) before computing \(\mathbf{y}_i\), which can be time consuming, and is therefore expensive to use as a parametrisation for the approximate posterior in VAEs.

This is an invertible transformation, and the inverse has the following form.

\[z_i = \frac{ y_i - \mu (\mathbf{y}_{1:i-1}) }{ \sigma (\mathbf{y}_{1:i-1}) } \tag{14}\]

Given vectors \(\mathbf{\mu}\) and \(\mathbf{\sigma}\), we can vectorise the inverse transformation, similar to equation (11), as

\[\mathbf{z} = \frac{ \mathbf{y} - \mathbf{\mu} (\mathbf{y}) }{ \mathbf{\sigma} (\mathbf{y}) }. \tag{15}\]

The Jacobian is again lower-triangular, with \(\frac{1}{\mathbf{\sigma}}\) on the diagonal and we can compute probability in a single pass.

The difference between the forward and the inverse transofrmations is that in the forward transformation, statistics used to transform every dimension depend on all the previously transformed dimensions. In the inverse transformation, the statistics used to invert \(\mathbf{y}\) (which is the input), depend only on that input, and not on any result of the inversion.

Masked Autoregressive Flow (MAF)

MAF directly uses equations (12) and (13) to transform as random variable. Since this transformation is inherently sequential, MAF is terribly slow when it comes to sampling. To evaluate the probability of a sample, however, we need the inverse mapping. MAF, which was designed for density estimation, can do that efficiently by using equation (15).

In principle, we could use it to parametrise the likelihood function (a.k.a. the decoder) in VAEs. Training would be fast, but, if the data dimensionality is high (e.g. images), generating new data would take very long. For a colour image of size \(300 \times 200\), we would need to perform \(300 \cdot 200 \cdot 3 = 1.8 \cdot 10^5\) sequential iterations of equation (13). This cannot be parallelised, and hence, we abandon the all powerful GPUs we otherwise use.

We could also use MAF as a prior \(p(\mathbf{z})\) in VAEs. Training requires only evaluation of a sample \(\mathbf{z} \sim q(\mathbf{z})\) under the prior \(p(\mathbf{z})\). The dimensionality \(d\) of the latent variable \(\mathbf{z}\) is typically much smaller than that of the output; often below \(1000\). Sampling can still be expensive, but at least doable.

What other applications would you use MAF in? Please write a comment if anything comes to mind.

Inverse Autoregressive Flow (IAF)

IAF defines a pdf by using a reparametrised version of equations (14) and (15), which we derive later. In this case, the transformed variable is defined as an inverse autoregressive mapping of the following form.

\[y_i = z_i \sigma (\mathbf{z}_{1:i-1}) + \mu (\mathbf{z}_{1:i-1}) \tag{16}\]

Since all \(\mu\) and \(\sigma\) depend only on \(\mathbf{z}\) but not on \(\mathbf{y}\), they can be all computed in parallel, in a single forward pass.

\[\mathbf{y} = \mathbf{z} \circ \sigma (\mathbf{z}) + \mu (\mathbf{z}). \tag{17}\]

To understand how IAF affects the pdf of \(\mathbf{z}\), we can compute the resulting probability density function. Other types of flows admit similar derivations. Here, we assume that \(\mathbf{z}\) follows a unit Gaussian,

\[\log q( \mathbf{z} ) = \log \mathcal{N} (\mathbf{z} \mid \mathbf{0}, \mathbf{I}) = - \sum_{i=1}^d \left( \log z_i + \frac{1}{2} \log 2 \pi \right) = - \frac{d}{2} \log 2 \pi - \sum_{i=1}^d \log z_i. \tag{18}\]

The final pdf can be comprised of \(K \in \mathcal{N}_+\) IAFs. To take this into account, we now set \(\mathbf{z}_k = \mathbf{z}\) and \(\mathbf{z}_{k+1} = \mathbf{y}\); i.e. \(\mathbf{z}_{k+1}\) is the result of transforming \(\mathbf{z}_k\). To factor in subsequent transformations, we need to compute all the Jacobians:

\[\frac{\partial \mathbf{z}_k}{\partial \mathbf{z}_{k-1}} = \underbrace{ \frac{\partial \mu_k}{\partial \mathbf{z}_{k-1}} + \frac{\partial \sigma_k}{\partial \mathbf{z}_{k-1}} \mathrm{diag} ( \mathbf{z}_{k-1} ) }_\text{lower triangular with zeros on the diagonal} + \mathrm{diag}( \sigma_k ) \underbrace{ \frac{\partial \mathbf{z}_{k-1}}{\partial \mathbf{z}_{k-1}} }_{= \mathbf{I}} \tag{19}\]

If \(\mu_k = \mu_k ( \mathbf{z}_{k-1})\) and \(\sigma_k = \sigma_k ( \mathbf{z}_{k-1})\) are implemented as autoregressive transformations (with respect to \(\mathbf{z}_{k-1}\)), then the first two terms in the Jacobian above are lower triangular matrices with zeros on the diagonal. The last term is a diagonal matrix, with \(\sigma_k\) on the diagonal. Thus, the determinant of the Jacobian is just

\[\mathrm{det} \left( \frac{\partial \mathbf{z}_k}{\partial \mathbf{z}_{k-1}} \right) = \prod_{i=1}^d \sigma_{k, i}. \tag{20}\]

Therefore, the final log-probability can be written as

\[\log q_K (\mathbf{z}_K) = \log q(\mathbf{z}) - \sum_{k=0}^K \sum_{i=1}^d \log \sigma_{k, i}. \tag{21}\]

Sampling from an IAF is easy, since we just sample \(\mathbf{z} \sim q(\mathbf{z})\) and then forward-transform it into \(\mathbf{z}_K\). Each of the transformations gives us the vector \(\sigma_k\), so that we can readily evaluate the probability of the sample \(q_K(\mathbf{z}_K)\).

To evaluate the density of a sample not taken from \(q_K\), we need to compute the chain of inverse transformations \(f^{-1}_k\), \(k = K, \dots, 0\). To do so, we have to sequentially compute

\[\mathbf{z}_{k-1, 1} = \frac{\mathbf{z}_{k, 1} - \mu_{k, 1}}{\sigma_{k, 1}},\\ \mathbf{z}_{k-1, i} = \frac{\mathbf{z}_{k, i} - \mu_{k, i} (\mathbf{z}_{k-1, 1:i-1})}{\sigma_{k, i} (\mathbf{z}_{k-1, 1:i-1})}. \tag{22}\]

This can be expensive, but as long as \(\mu\) and \(\sigma\) are implemented as autoregressive transformations, it is possible.

MAF vs IAF

Both MAF and IAF use autoregressive transformations, but in a different way. To see that IAF really is the inverse of MAF and that the equation (16) is in fact a reparametrised version of equation (14), set \(\tilde{z}_i = y_i\), \(\tilde{y}_i = z_i\), \(\tilde{\mu} = -\frac{\mu}{\sigma}\) and \(\tilde{\sigma} = \frac{1}{\sigma}\).

\[(16) \implies \tilde{z}_i = -\frac{\tilde{\mu} (\tilde{\mathbf{y}}_{1:i-1})}{ \tilde{\sigma} (\tilde{\mathbf{y}}_{1:i-1})} + \frac{1}{\tilde{\sigma} (\tilde{\mathbf{y}}_{1:i-1})}\tilde{y}_i = \frac{\tilde{y}_i - \tilde{\mu} (\tilde{\mathbf{y}}_{1:i-1})}{ \tilde{\sigma}(\tilde{\mathbf{y}}_{1:i-1})} = (14).\]

This reparametrisation is useful, because it avoids divisions, which can be numerically unstable. To allow the vectorised form of equations (15) and (17), \(\mu\) and \(\sigma\) have to be implemented as autoregressive functions; and one efficient way to do so is to use MADE-type neural networks (nicely explained in this blog post by Ferenc). In fact, both original papers use MADE as a building block.

To understand the trade-offs between MAF and IAF, it is instructive to study equations (15) and (17) in detail. You will notice, that although the equations look very similar, the position of inputs \(\mathbf{z}\) and outputs \(\mathbf{y}\) is swapped. This is why for IAF, sampling is efficient but density estimation is not, while for MAF, sampling is inefficient while density estimation is very fast.

Parallel WaveNet introduced the notion of Distribution Distillation, which combines the advantages of both types of flows. It trains one model, which closely resembles MAF, for density estimation. Its role is just to evaluate probability of a data point, given that data point. Once this model is trained, the authors instantiate a second model parametrised by IAF. Now, we can draw samples from IAF and evaluate their probability under the MAF. This allows us to compute Monte-Carlo approximation of the KL-divergence between the two probability distributions, which we can use as a training objective for IAF. This way, MAF acts as a teacher and IAF as a student. This clever application of both types of flows allowed to improve efficiency of the original WaveNet by the factor of 300.

What is wrong with VAEs?

Wed, 14 Mar 2018 15:15:00 +0000

Latent Variable Models

Suppose you would like to model the world in terms of the probability distribution over its possible states \(p(\mathbf{x})\) with \(\mathbf{x} \in \mathcal{R}^D\). The world may be complicated and we do not know what form \(p(\mathbf{x})\) should have. To account for it, we introduce another variable \(\mathbf{z} \in \mathcal{R}^d\), which describes, or explains the content of \(\mathbf{x}\). If \(\mathbf{x}\) is an image, \(\mathbf{z}\) can contain information about the number, type and appearance of objects visible in the scene as well as the background and lighting conditions. This new variable allows us to express \(p(\mathbf{x})\) as an infinite mixture model,

\[p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})~d \mathbf{z}. \tag{1}\]

It is a mixture model, because for every possible value of \(\mathbf{z}\), we add another conditional distribution to \(p(\mathbf{x})\), weighted by its probability.

Having a setup like that, it is interesting to ask what the latent variables \(\mathbf{z}\) are, given an observation \(\mathbf{x}\). Namely, we would like to know the posterior distribution \(p(\mathbf{z} \mid \mathbf{x})\). However, the relationship between \(\mathbf{z}\) and \(\mathbf{x}\) can be highly non-linear (e.g. implemented by a multi-layer neural network) and both \(D\), the dimensionality of our observations, and \(d\), the dimensionality of the latent variable, can be quite large. Since both marginal and posterior probability distributions require evaluation of the integral in eq. (1), they are intractable.

We could try to approximate eq. (1) by Monte-Carlo sampling as \(p(\mathbf{x}) \approx \frac{1}{M} \sum_{m=1}^M p(\mathbf{x} \mid \mathbf{z}^{(m)})\), \(\mathbf{z}^{(m)} \sim p(\mathbf{z})\), but since the volume of \(\mathbf{z}\)-space is potentially large, we would need millions of samples of \(\mathbf{z}\) to get a reliable estimate.

To train a probabilistic model, we can use a parametric distribution - parametrised by a neural network with parameters \(\theta \in \Theta\). We can now learn the parameters by maximum likelihood estimation,

\[\theta^\star = \arg \max_{\theta \in \Theta} p_\theta(\mathbf{x}). \tag{2}\]

The problem is, we cannot maximise an expression (eq. (1)), which we can’t even evaluate. To improve things, we can resort to importance sampling (IS). When we need to evaluate an expectation with respect to the original (nominal) probability density function (pdf), IS allows us to sample from a different probability distribution (proposal) and then weigh those samples with respect to the nominal pdf. Let \(q_\phi ( \mathbf{z} \mid \mathbf{x})\) be our proposal - a probability distribution parametrised by a neural network with parameters \(\phi \in \Phi\). We can write

\[p_\theta(\mathbf{x}) = \int p(\mathbf{z}) p_\theta (\mathbf{x} \mid \mathbf{z})~d \mathbf{z} =\\ \mathbb{E}_{p(\mathbf{z})} \left[ p_\theta (\mathbf{x} \mid \mathbf{z} )\right] = \mathbb{E}_{p(\mathbf{z})} \left[ \frac{q_\phi ( \mathbf{z} \mid \mathbf{x})}{q_\phi ( \mathbf{z} \mid \mathbf{x})} p_\theta (\mathbf{x} \mid \mathbf{z} )\right] = \mathbb{E}_{q_\phi ( \mathbf{z} \mid \mathbf{x})} \left[ \frac{p_\theta (\mathbf{x} \mid \mathbf{z} ) p(\mathbf{z})}{q_\phi ( \mathbf{z} \mid \mathbf{x})} )\right]. \tag{3}\]

From importance sampling literature we know that the optimal proposal is proportional to the nominal pdf times the function, whose expectation we are trying to approximate. In our setting, that function is just \(p_\theta (\mathbf{x} \mid \mathbf{z} )\). From Bayes’ theorem, \(p(z \mid x) = \frac{p(x \mid z) p (z)}{p(x)}\), we see that the optimal proposal is proportional to the posterior distribution, which is of course intractable.

Rise of a Variational Autoencoder

Fortunately, it turns out, we can kill two birds with one stone: by trying to approximate the posterior with a learned proposal, we can efficiently approximate the marginal probability \(p_\theta(\mathbf{x})\). A bit by accident, we have just arrived at an autoencoding setup. To learn our model, we need

\(p_\theta ( \mathbf{x}, \mathbf{z})\) - the generative model, which consists of
- \(p_\theta ( \mathbf{x} \mid \mathbf{z})\) - a probabilistic decoder, and
- \(p ( \mathbf{z})\) - a prior over the latent variables,
\(q_\phi ( \mathbf{z} \mid \mathbf{x})\) - a probabilistic encoder.

To approximate the posterior, we can use the KL-divergence (think of it as a distance between probability distributions) between the proposal and the posterior itself; and we can minimise it.

\[KL \left( q_\phi (\mathbf{z} \mid \mathbf{x}) || p_\theta(\mathbf{z} \mid \mathbf{x}) \right) = \mathbb{E}_{q_\phi (\mathbf{z} \mid \mathbf{x})} \left[ \log \frac{q_\phi (\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z} \mid \mathbf{x})} \right] \tag{4}\]

Our new problem is, of course, that to evaluate the KL we need to know the posterior distribution. Not all is lost, for doing a little algebra can give us an objective function that is possible to compute.

\[\begin{align} KL &\left( q_\phi (\mathbf{z} \mid \mathbf{x}) || p_\theta(\mathbf{z} \mid \mathbf{x}) \right)\\ &=\mathbb{E}_{q_\phi (\mathbf{z} \mid \mathbf{x})} \left[ \log q_\phi (\mathbf{z} \mid \mathbf{x}) - \log p_\theta(\mathbf{z} \mid \mathbf{x}) \right]\\ &=\mathbb{E}_{q_\phi (\mathbf{z} \mid \mathbf{x})} \left[ \log q_\phi (\mathbf{z} \mid \mathbf{x}) - \log p_\theta(\mathbf{z}, \mathbf{x}) \right] + \log p_\theta(\mathbf{x})\\ &= -\mathcal{L} (\mathbf{x}; \theta, \phi) + \log p_\theta(\mathbf{x}) \tag{5} \end{align}\]

Where on the second line I expanded the logarithm, on the third line I used the Bayes’ theorem and the fact that \(p_\theta (\mathbf{x})\) is independent of \(\mathbf{z}\). \(\mathcal{L} (\mathbf{x}; \theta, \phi)\) in the last line is a lower bound on the log probability of data \(p_\theta (\mathbf{x})\) - the so-called evidence-lower bound (ELBO). We can rewrite it as

\[\log p_\theta(\mathbf{x}) = \mathcal{L} (\mathbf{x}; \theta, \phi) + KL \left( q_\phi (\mathbf{z} \mid \mathbf{x}) || p_\theta(\mathbf{z} \mid \mathbf{x}) \right), \tag{6}\] \[\mathcal{L} (\mathbf{x}; \theta, \phi) = \mathbb{E}_{q_\phi (\mathbf{z} \mid \mathbf{x})} \left[ \log \frac{ p_\theta (\mathbf{x}, \mathbf{z}) }{ q_\phi (\mathbf{z} \mid \mathbf{x}) } \right]. \tag{7}\]

We can approximate it using a single sample from the proposal distribution as

\[\mathcal{L} (\mathbf{x}; \theta, \phi) \approx \log \frac{ p_\theta (\mathbf{x}, \mathbf{z}) }{ q_\phi (\mathbf{z} \mid \mathbf{x}) }, \qquad \mathbf{z} \sim q_\phi (\mathbf{z} \mid \mathbf{x}). \tag{8}\]

We train the model by finding \(\phi\) and \(\theta\) (usually by stochastic gradient descent) that maximise the ELBO:

\[\phi^\star,~\theta^\star = \arg \max_{\phi \in \Phi,~\theta \in \Theta} \mathcal{L} (\mathbf{x}; \theta, \phi). \tag{9}\]

By maximising the ELBO, we (1) maximise the marginal probability or (2) minimise the KL-divergence, or both. It is worth noting that the approximation of ELBO has the form of the log of importance-sampled expectation of \(f(\mathbf{x}) = 1\), with importance weights \(w(\mathbf{x}) = \frac{ p_\theta (\mathbf{x}, \mathbf{z}) }{ q_\phi (\mathbf{z} \mid \mathbf{x})}\).

What is wrong with this estimate?

If you look long enough at importance sampling, it becomes apparent that the support of the proposal distribution should be wider than that of the nominal pdf - both to avoid infinite variance of the estimator and numerical instabilities. In this case, it would be better to optimise the reverse \(KL(p \mid\mid q)\), which has mode-averaging behaviour, as opposed to \(KL(q \mid\mid p)\), which tries to match the mode of \(q\) to one of the modes of \(p\). This would typically require taking samples from the true posterior, which is hard. Instead, we can use IS estimate of the ELBO, introduced as Importance Weighted Autoencoder (IWAE). The idea is simple: we take \(K\) samples from the proposal and we use an average of probability ratios evaluated at those samples. We call each of the samples a particle.

\[\mathcal{L}_K (\mathbf{x}; \theta, \phi) \approx \log \frac{1}{K} \sum_{k=1}^{K} \frac{ p_\theta (\mathbf{x},~\mathbf{z^{(k)}}) }{ q_\phi (\mathbf{z^{(k)}} \mid \mathbf{x}) }, \qquad \mathbf{z}^{(k)} \sim q_\phi (\mathbf{z} \mid \mathbf{x}). \tag{10}\]

This estimator has been shown to optimise the modified KL-divergence \(KL(q^{IS} \mid \mid p^{IS})\), with \(q^{IS}\) and \(p^{IS}\) defined as \(q^{IS} = q^{IS}_\phi (\mathbf{z} \mid \mathbf{x}) = \frac{1}{K} \prod_{k=1}^K q_\phi ( \mathbf{z}^{(k)} \mid \mathbf{x} ), \tag{11}\)

\[p^{IS} = p^{IS}_\theta (\mathbf{z} \mid \mathbf{x}) = \frac{1}{K} \sum_{k=1}^K \frac{ q^{IS}_\phi (\mathbf{z} \mid \mathbf{x}) }{ q_\phi (\mathbf{z^{(k)}} \mid \mathbf{x}) } p_\theta (\mathbf{z}^{(k)} \mid \mathbf{x}). \tag{12}\]

While similar to the original distributions, \(q^{IS}\) and \(p^{IS}\) allow small variations in \(q\) and \(p\) that we would not have expected. Optimising this lower bound leads to better generative models, as shown in the original paper. It also leads to higher-entropy (wider, more scattered) estimates of the approximate posterior \(q\), effectively breaking the mode-matching behaviour of the original KL-divergence. As a curious consequence, if we increase the number of particles \(K\) to infinity, we no longer need the inference model \(q\).

Posterior distribution of z for the IWAE (top row) and VAE (bottom row). Figure reproduced from the IWAE paper.

What is wrong with IWAE?

The importance-weighted ELBO, or the IWAE, generalises the original ELBO: for \(K=1\), we have \(\mathcal{L}_K = \mathcal{L}_1 = \mathcal{L}\). It is also true that \(\log p(\mathbf{x}) \geq \mathcal{L}_{n+1} \geq \mathcal{L}_n \geq \mathcal{L}_1\). In other words, the more particles we use to estimate \(\mathcal{L}_K\), the closer it gets in value to the true log probability of data - we say that the bound becomes tighter. This means that the gradient estimator, derived by differentiating the IWAE, points us in a better direction than the gradient of the original ELBO would. Additionally, as we increase \(K\), the variance of that gradient estimator shrinks.

It is great for the generative model, but, as we shown in our recent paper Tighter Variational Bounds are Not Necessarily Better, it turns out to be problematic for the proposal. The magnitude of the gradient with respect to proposal parameters goes to zero with increasing \(K\), and it does so much faster than its variance.

Let \(\Delta (\phi)\) be a minibatch estimate of the gradient of an objective function we’re optimising (e.g. ELBO) with respect to \(\phi\). If we define signal-to-noise ratio (SNR) of the parameter update as

\[SNR(\phi) = \frac{ \left| \mathbb{E} \left[ \Delta (\phi ) \right] \right| }{ \mathbb{V} \left[ \Delta (\phi ) \right]^{\frac{1}{2}} }, \tag{13}\]

where \(\mathbb{E}\) and \(\mathbb{V}\) are expectation and variance, respectively, it turns out that SNR increases with \(K\) for \(p_\theta\), but it decreases for \(q_\phi\). The conclusion here is simple: the more particles we use, the worse the inference model becomes. If we care about representation learning, we have a problem.

Better estimators

We can do better than the IWAE, as we’ve shown in our paper. The idea is to use separate objectives for the inference and the generative models. By doing so, we can ensure that both get non-zero low-variance gradients, which lead to better models.

Signal-to-Noise ratio for the proposal across training epochs for different training objectives.

In the above plot, we compare SNR of the updates of parameters \(\phi\) of the proposal \(q_\phi\) acorss training epochs. VAE, which shows the highest SNR, is trained by optimising \(\mathcal{L}_1\). IWAE, trained with \(\mathcal{L}_{64}\), has the lowest SNR. The three curves in between use different combinations of \(\mathcal{L}_{64}\) for the generative model and \(\mathcal{L}_8\) or \(\mathcal{L}_1\) for the inference model. While not as good as the VAE under this metric, they all lead to training better proposals and generative models than either VAE or IWAE.

As a, perhaps surprising, side effect, models trained with our new estimators achieve higher \(\mathcal{L}_{64}\) bounds than the IWAE itself trained with this objective. Why? By looking at the effective sample-size (ESS) and the marginal log probability of data, it looks like optimising \(\mathcal{L}_1\) leads to producing the best quality proposals, but the worst generative models. If we combine a good proposal with an objective that leads to good generative models, we should be able to provide lower-variance estimate of this objective and thus learn even better models. Please see our paper for details.

Attention in Neural Networks and How to Use It

Sat, 14 Oct 2017 11:00:00 +0000

Attention mechanisms in neural networks, otherwise known as neural attention or just attention, have recently attracted a lot of attention (pun intended). In this post, I will try to find a common denominator for different mechanisms and use-cases and I will describe (and implement!) two mechanisms of soft visual attention.

What is Attention?

Informally, a neural attention mechanism equips a neural network with the ability to focus on a subset of its inputs (or features): it selects specific inputs. Let \(\mathbf{x} \in \mathcal{R}^d\) be an input vector, \(\mathbf{z} \in \mathcal{R}^k\) a feature vector, \(\mathbf{a} \in [0, 1]^k\) an attention vector, \(\mathbf{g} \in \mathcal{R}^k\) an attention glimpse and \(f_\mathbb{\phi}(\mathbf{x})\) an attention network with parameters \(\mathbb{\phi}\). Typically, attention is implemented as

\[\begin{align} \mathbf{a} &= f_\phi(\mathbf{x}), \tag{1} \label{att}\\ \mathbf{g} &= \mathbf{a} \odot \mathbf{z}, \end{align}\]

where \(\odot\) is element-wise multiplication, while \(\mathbf{z}\) is an output of another neural network \(f_\mathbf{\theta} (\mathbf{x})\) with parameters \(\mathbf{\theta}\). We can talk about soft attention, which multiplies features with a (soft) mask of values between zero and one, or hard attention, when those values are constrained to be exactly zero or one, namely \(\mathbf{a} \in \{0, 1\}^k\). In the latter case, we can use the hard attention mask to directly index the feature vector: \(\tilde{\mathbf{g}} = \mathbf{z}[\mathbf{a}]\) (in Matlab notation), which changes its dimensionality and now \(\tilde{\mathbf{g}} \in \mathcal{R}^m\) with \(m \leq k\).

To understand why attention is important, we have to think about what a neural network really is: a function approximator. Its ability to approximate different classes of functions depends on its architecture. A typical neural net is implemented as a chain of matrix multiplications and element-wise non-linearities, where elements of the input or feature vectors interact with each other only by addition.

Attention mechanisms compute a mask which is used to multiply features. This seemingly innocent extension has profound implications: suddenly, the space of functions that can be well approximated by a neural net is vastly expanded, making entirely new use-cases possible. Why? While I have no proof, the intuition is following: the theory says that neural networks are universal function approximators and can approximate an arbitrary function to arbitrary precision, but only in the limit of an infinite number of hidden units. In any practical setting, that is not the case: we are limited by the number of hidden units we can use. Consider the following example: we would like to approximate the product of \(N >> 0\) inputs. A feed-forward neural network can do it only by simulating multiplications with (many) additions (plus non-linearities), and thus it requires a lot of neural-network real estate. If we introduce multiplicative interactions, it becomes simple and compact.

The above definition of attention as multiplicative interactions allow us to consider a broader class of models if we relax the constrains on the values of the attention mask and let \(\mathbf{a} \in \mathcal{R}^k\). For example, Dynamic Filter Networks (DFN) use a filter-generating network, which computes filters (or weights of arbitrary magnitudes) based on inputs, and applies them to features, which effectively is a multiplicative interaction. The only difference with soft-attention mechanisms is that the attention weights are not constrained to lie between zero and one. Going further in that direction, it would be very interesting to learn which interactions should be additive and which multiplicative, a concept explored in A Differentiable Transition Between Additive and Multiplicative Neurons. The excellent distill blog provides a great overview of soft-attention mechanisms.

Visual Attention

Attention can be applied to any kind of inputs, regardless of their shape. In the case of matrix-valued inputs, such as images, we can talk about visual attention. Let \(\mathbf{I} \in \mathcal{R}^{H \times W}\) be an image and \(\mathbf{g} \in \mathcal{R}^{h \times w}\) an attention glimpse i.e. the result of applying an attention mechanism to the image \(\mathbf{I}\).

Hard Attention

Hard attention for images has been known for a very long time: image cropping. It is very easy conceptually, as it only requires indexing. Let \(y \in [0, H - h]\) and \(x \in [0, W - w]\) be coordinates in the image space; hard-attention can be implemented in Python (or Tensorflow) as

g = I[y:y+h, x:x+w]

The only problem with the above is that it is non-differentiable; to learn the parameters of the model, one must resort to e.g. the score-function estimator (REINFORCE), briefly mentioned in my previous post.

Soft Attention

Soft attention, in its simplest variant, is no different for images than for vector-valued features and is implemented exactly as in equation \ref{att}. One of the early uses of this types of attention comes from the paper called Show, Attend and Tell: The model learns to attend to specific parts of the image while generating the word describing that part.

This type of soft attention is computationally wasteful, however. The blacked-out parts of the input do not contribute to the results but still need to be processed. It is also over-parametrised: sigmoid activations that implement the attention are independent of each other. It can select multiple objects at once, but in practice we often want to be selective and focus only on a single element of the scene. The two following mechanisms, introduced by DRAW and Spatial Transformer Networks, respectively, solve this issue. They can also resize the input, leading to further potential gains in performance.

Gaussian Attention

Gaussian attention works by exploiting parametrised one-dimensional Gaussian filters to create an image-sized attention map. Let \(\mathbf{a}_y \in \mathcal{R}^H\) and \(\mathbf{a}_x \in \mathcal{R}^W\) be attention vectors, which specify which part of the image should be attended to in \(y\) and \(x\) axis, respectively. The attention masks can be created as \(\mathbf{a} = \mathbf{a}_y \mathbf{a}_x^T\).

In the above figure, the top row shows \(\mathbf{a}_x\), the column on the right shows \(\mathbf{a}_y\) and the middle rectangle shows the resulting \(\mathbf{a}\). Here, for the visualisation purposes, the vectors contain only zeros and ones. In practice, they can be implemented as vectors of one-dimensional Gaussians. Typically, the number of Gaussians is equal to the spatial dimension and each vector is parametrised by three parameters: centre of the first Gaussian \(\mu\), distance between centres of consecutive Gaussians \(d\) and the standard deviation of the Gaussians \(\sigma\). With this parametrisation, both attention and the glimpse are differentiable with respect to attention parameters, and thus easily learnable.

Attention in the above form is still wasteful, as it selects only a part of the image while blacking-out all the remaining parts. Instead of using the vectors directly, we can cast them into matrices \(A_y \in \mathcal{R}^{h \times H}\) and \(A_x \in \mathcal{R}^{w \times W}\), respectively. Now, each matrix has one Gaussian per row and the parameter \(d\) specifies distance (in column units) between centres of Gaussians in consecutive rows. The glimpse is now implemented as

\[\mathbf{g} = A_y \mathbf{I} A_x^T.\]

I used this mechanism in HART, my recent paper on biologically-inspired object tracking with RNNs with attention. Here is an example with the input image on the left hand side and the attention glimpse on the right hand side; the glimpse shows the box marked in the main image in green:

The code below lets you create one of the above matrix-valued masks for a mini-batch of samples in Tensorflow. If you want to create \(A_y\), you would call it as Ay = gaussian_mask(u, s, d, h, H), where u, s, d are \(\mu, \sigma\) and \(d\), in that order and specified in pixels.

def gaussian_mask(u, s, d, R, C):
    """
    :param u: tf.Tensor, centre of the first Gaussian.
    :param s: tf.Tensor, standard deviation of Gaussians.
    :param d: tf.Tensor, shift between Gaussian centres.
    :param R: int, number of rows in the mask, there is one Gaussian per row.
    :param C: int, number of columns in the mask.
    """
    # indices to create centres
    R = tf.to_float(tf.reshape(tf.range(R), (1, 1, R)))
    C = tf.to_float(tf.reshape(tf.range(C), (1, C, 1)))
    centres = u[np.newaxis, :, np.newaxis] + R * d
    column_centres = C - centres
    mask = tf.exp(-.5 * tf.square(column_centres / s))
    # we add eps for numerical stability
    normalised_mask = mask / (tf.reduce_sum(mask, 1, keep_dims=True) + 1e-8)
    return normalised_mask

We can also write a function to directly extract a glimpse from the image:

def gaussian_glimpse(img_tensor, transform_params, crop_size):
    """
    :param img_tensor: tf.Tensor of size (batch_size, Height, Width, channels)
    :param transform_params: tf.Tensor of size (batch_size, 6), where params are  (mean_y, std_y, d_y, mean_x, std_x, d_x) specified in pixels.
    :param crop_size): tuple of 2 ints, size of the resulting crop
    """
    # parse arguments
    h, w = crop_size
    H, W = img_tensor.shape.as_list()[1:3]
    split_ax = transform_params.shape.ndims -1
    uy, sy, dy, ux, sx, dx = tf.split(transform_params, 6, split_ax)
    # create Gaussian masks, one for each axis
    Ay = gaussian_mask(uy, sy, dy, h, H)
    Ax = gaussian_mask(ux, sx, dx, w, W)
    # extract glimpse
    glimpse = tf.matmul(tf.matmul(Ay, img_tensor, adjoint_a=True), Ax)
    return glimpse

Spatial Transformer

Spatial Transformer (STN) allows for much more general transformation that just differentiable image-cropping, but image cropping is one of the possible use cases. It is made of two components: a grid generator and a sampler. The grid generator specifies a grid of points to be sampled from, while the sampler, well, samples. The Tensorflow implementation is particularly easy in Sonnet, a recent neural network library from DeepMind.

def spatial_transformer(img_tensor, transform_params, crop_size):
    """
    :param img_tensor: tf.Tensor of size (batch_size, Height, Width, channels)
    :param transform_params: tf.Tensor of size (batch_size, 4), where params are  (scale_y, shift_y, scale_x, shift_x)
    :param crop_size): tuple of 2 ints, size of the resulting crop
    """
    constraints = snt.AffineWarpConstraints.no_shear_2d()
    img_size = img_tensor.shape.as_list()[1:]
    warper = snt.AffineGridWarper(img_size, crop_size, constraints)
    grid_coords = warper(transform_params)
    glimpse = snt.resampler(img_tensor[..., tf.newaxis], grid_coords)
    return glimpse

Gaussian Attention vs. Spatial Transformer

Both Gaussian attention and Spatial Transformer can implement a very similar behaviour. How do we choose which to use? There are several nuances:

Gaussian attention is an over-parametrised cropping mechanism: it requires six parameters, but there are only four degrees of freedom (y, x, height width). STN needs only four parameters.
I haven’t run any tests yet, but STN should be faster. It relies on linear interpolation at sampling points, while the Gaussian attention has to perform two huge matrix multiplications. STN could be an order of magnitude faster (in terms of pixels in the input image).
Gaussian attention should be (no tests run) easier to train. This is because every pixel in the resulting glimpse can be a convex combination of a relatively big patch of pixels of the source image, which (informally) makes it easier to find the cause of any errors. STN, on the other hand, relies on linear interpolation, which means that gradient at every sampling point is non-zero only with respect to the two nearest pixels in each axis.

A Minimum Working Example

Let’s create a minimum working example of Gaussian Attention and STN. First, we need to import a few libraries, define sizes and create an input image and a crop.

import tensorflow as tf
import sonnet as snt
import numpy as np
import matplotlib.pyplot as plt

img_size = 10, 10
glimpse_size = 5, 5

# Create a random image with a square
x = abs(np.random.randn(1, *img_size)) * .3
x[0, 3:6, 3:6] = 1
crop = x[0, 2:7, 2:7] # contains the square

Now, we need placeholders for Tensorflow variables.

tf.reset_default_graph()

# placeholders
tx = tf.placeholder(tf.float32, x.shape, 'image')
tu = tf.placeholder(tf.float32, [1], 'u')
ts = tf.placeholder(tf.float32, [1], 's')
td = tf.placeholder(tf.float32, [1], 'd')
stn_params = tf.placeholder(tf.float32, [1, 4], 'stn_params')

We can now define the Tensorflow expression for Gaussian Attention and STN glimpses.

# Gaussian Attention
gaussian_att_params = tf.concat([tu, ts, td, tu, ts, td], -1)
gaussian_glimpse_expr = gaussian_glimpse(tx, gaussian_att_params, glimpse_size)

# Spatial Transformer
stn_glimpse_expr = spatial_transformer(tx, stn_params, glimpse_size)

Let’s run those expressions and plot them:

sess = tf.Session()

# extract a Gaussian glimpse
u = 2
s = .5
d = 1
u, s, d = (np.asarray([i]) for i in (u, s, d))
gaussian_crop = sess.run(gaussian_glimpse_expr, feed_dict={tx: x, tu: u, ts: s, td: d})

# extract STN glimpse
transform = [.4, -.1, .4, -.1]
transform = np.asarray(transform).reshape((1, 4))
stn_crop = sess.run(stn_glimpse_expr, {tx: x, stn_params: transform})


# plots
fig, axes = plt.subplots(1, 4, figsize=(12, 3))

titles = ['Input Image', 'Crop', 'Gaussian Att', 'STN']
imgs = [x, crop, gaussian_crop, stn_crop]
for ax, title, img in zip(axes, titles, imgs):
    ax.imshow(img.squeeze(), cmap='gray', vmin=0., vmax=1.)
    ax.set_title(title)
    ax.xaxis.set_visible(False)
    ax.yaxis.set_visible(False)

You can find a Jupyter Notebook with the code used to create the above here

Closing Thoughts

Attention mechanisms expand capabilities of neural networks: they allow approximating more complicated functions, or in more intuitive terms, they enable focusing on specific parts of the input. They have led to performance improvements in natural language benchmarks, as well as to entirely new capabilities such as image captioning, addressing in memory networks and neural programmers.

I believe that the most important cases in which attention is useful have not been discovered yet. For example, we know that objects in videos are consistent and coherent, e.g. they do not disappear into thin air between frames. Attention mechanisms can be used to express this consistency prior. How? Stay tuned.

Conditional KL-divergence in Hierarchical VAEs

Sun, 10 Sep 2017 13:57:00 +0000

Inference is hard and often computationally expensive. Variational Autoencoders (VAE) lead to an efficient amortised inference scheme, where amortised means that once the model is trained (which can take a long time), the inference has constant computational complexity. Variational Autoencoders (VAE) learn the approximate posterior distribution \(q(z\mid x)\) over some latent variables \(z\) by maximising a lower bound on the true data likelihood \(p(x)\). This is useful, because the latent variables explain what we see (\(x\)), and often in a concise form.

One problem with VAEs is that we have to assume some functional form for \(q\). While the majority of papers take the Gaussian distribution with a diagonal covariance matrix, it has been shown that more complex (e.g. multi-modal) approximate posterior distributions can improve the quality of the model, with a good example being the normalizing flows paper.

Normalizing flows take a simple probability distribution (here: a Gaussian) and apply a series of invertible transformations to get a more complicated distribution. While useful, the resulting distribution is limited by the form of the transforming functions, which in this case have to be invertible. Another way of achieving the same goal is to split the latent variables into two groups \(z = \{u, v\}\), say, and express the joint distribution as \(q(z) = q(u, v) = q(u \mid v) q(v)\) by using the product rule of probability. The conditional distribution \(q(u \mid v)\) can depend on \(v\) in a highly non-linear fashion (it can be implemented as a neural net). Even though both the marginal \(q(v)\) and the conditional \(q(u \mid v)\) can be Gaussians, their joint might be highly non-Gaussian. Consider the below example and the resulting density plot (in the plot \(x=v\) and \(y=u\)).

\[\begin{align} q(v) &= \mathcal{N} (v \mid 0, I)\\ q(u \mid v) &= \mathcal{N} (u \mid Fv, Fvv^TF^T + \beta I), \tag{1} \label{hierarchical} \end{align}\]

The above density plot shows a highly non-Gaussian probability distribution. \(x \sim q(v)\) is, in fact, a Gaussian random variable, but \(y \sim q(u \mid v)\) is not, since its variance is not constant and depends on its mean: the variance increases with the increasing distance from the mean, resulting in heavy tails. In the above plot, \(q(u \mid v)\) is obtained as a simple transformation of \(v \sim q(v)\), which could be implemented by a one-layer neural net; see Importance Weighted Autoencoders for a more general example. This simple scheme results in a VAE with a hierarchy of latent variables and can lead to much more complicated posterior distributions, but it also leads to a more complicated (and ambiguous) optimisation procedures. Let me elaborate.

As part of the variational objective, the learning process is optimising the Kullback-Leibler divergence \(KL[q \mid p]\) between the approximate posterior \(q\) and a prior over the latent variables \(p\). KL is an asymmetric measure of similarity between two probability distributions \(q\) and \(p\) that is often used in machine learning. It can be interpreted as the information gain from using \(q\) instead of \(p\), or in the context of coding theory, the extra number of bits to code samples from \(q\) by using \(p\). You can read more about information measures in this cheat sheet. It is defined as

\[KL[q(z) \mid \mid p(z)] = \int q(z) \log \frac{q(z)}{p(z)} \mathrm{d}z. \tag{2} \label{kl_def}\]

If we split the random variable \(z\) into two disjoint sets \(z = \{u, v\}\) as above, the KL factorises as

\[\begin{align} KL[q(u, v) \mid \mid p(u, v)] &= \iint q(u, v) \log \frac{q(u, v)}{p(u, v)} \mathrm{d}u \mathrm{d}v\\ % sum of integrals &= \int q(v) \log \frac{q(v)}{p(v)} \mathrm{d}v + \int q(v) \int q(u \mid v) \log \frac{q(u \mid v)}{p(u \mid v))} \mathrm{d}u \mathrm{d}v \tag{3}\\ % sum of KLs &= KL[q(v) \mid \mid p(v)] + KL[q(u \mid v) \mid \mid p(u \mid v)], \label{conditional_kl} \end{align}\]

where \(KL[q(u \mid v) \mid \mid p(u \mid v)] = \mathbb{E}_{q(v)} \left[ \tilde{KL}[q(u \mid v) \mid \mid p(u \mid v)] \right]\) is known as the conditional KL-divergence, with

\[\tilde{KL}[q(u \mid v) \mid \mid p(u \mid v) = \int q(u \mid v) \log \frac{q(u \mid v)}{p(u \mid v))} \mathrm{d}u \tag{4}.\]

The conditional KL-divergence amounts to the expected value of the KL-divergence between conditional distributions \(q(u \mid v)\) and \(p(u \mid v)\), where the expectation is taken with respect to \(q(v)\). Since KL-divergence is non-negative, both terms are non-negative. KL is equal to zero only when both probability distributions are exactly equal. The conditional KL is equal to zero when both conditional distributions are exactly equal on the whole support defined by \(q(v)\). This last bit makes it difficult to optimise with respect to the parameters of both distributions.

Let \(q(z) = q_\psi(u, v) = q_\phi (u \mid v) q_\theta(v)\), such that the posterior is parametrised by \(\psi = \begin{bmatrix} \phi\\ \theta\end{bmatrix}\). If we look at the gradient of the KL divergence, we have that

\[\begin{align} \nabla_\psi &KL[q_\psi(u, v) \mid \mid p(u, v)] = \begin{bmatrix} \nabla_\phi KL[q_\psi(u, v) \mid \mid p(u, v)] \\ \nabla_\theta KL[q_\psi(u, v) \mid \mid p(u, v)] \end{bmatrix} \tag{5}, \end{align}\]

with

\[\nabla_\phi KL[q_\psi(u, v) \mid \mid p(u, v)] = \nabla_\phi KL[q_\phi(u \mid v) \mid \mid p(u \mid v)] \tag{6},\] \[\nabla_\theta KL[q_\psi(u, v) \mid \mid p(u, v)] = \nabla_\theta KL[q_\theta(v) \mid \mid p(v)] + \nabla_\theta KL[q_\phi(u \mid v) \mid \mid p(u \mid v)], \tag{7}\]

where the gradient with respect to the parameters of the lower-level distribution \(q_\theta(v)\) is comprised of two components. The second component is problematic. Let’s have a closer look:

\[\begin{align} \nabla_\theta &KL[q_\phi(u \mid v) \mid \mid p(u \mid v)] = \nabla_\theta \mathbb{E}_{q_\theta(v)} \left[ \tilde{KL}[q_\phi (u \mid v) \mid \mid p(u \mid v) \right] \tag{8}\\ &= \mathbb{E}_{q_\theta(v)} \left[ \tilde{KL}[q_\phi (u \mid v) \mid \mid p(u \mid v)] \nabla_\theta \log q_\theta(v) \right], \end{align}\]

where in the second line we used the log-derivative trick (suggested here by Max Soelch, thanks!). This score-function formulation makes it clear that following the (negative, as in SGD) gradient estimate maximises the probability of samples for which the conditional-KL divergence has the lowest values. In particular, it might be easier to change the support \(q_\theta(v)\) to a volume where both conditionals have very small values instead of optimising \(q_\phi(u \mid v)\). From my experience, it happens especially when the value of the conditional KL is much bigger than the value of the first KL term.

An alternative approach would be to optimise the conditional-KL only with respect to the parameters of the distribution inside the expectation: \(\phi\). That would result in the following gradient equation:

\[\nabla_\psi KL[q_\psi(u, v) \mid \mid p(u, v)] = \begin{bmatrix} 0\\ \nabla_\theta KL[q_\theta(v) \mid \mid p(v)] \end{bmatrix} + \begin{bmatrix} \nabla_\phi KL[q_\phi(u \mid v) \mid \mid p(u \mid v)] \\ 0 \end{bmatrix} \tag{9}\]

This optimisation scheme resembles the Expectation-Maximisation (EM) algorithm. In the E step, we compute the expectations, while in the M step we fix the parameters with respect to which the expectations were computed and we maximise with respect to the functions inside the expectation. In EM we do this, because maximum-likelihood with latent variables often does not have closed-form solutions. The motivation here is to make the optimisation more stable.

I wrote this blog post, because I have no idea whether this changed optimisation procedure is justified in any way. What do you think? I would appreciate any comments.

Implementing Attend, Infer, Repeat

Sun, 03 Sep 2017 14:44:17 +0000

Variational Autoencoders (VAEs) are getting more and more popular in the Machine Learning community. While the formulation is more involved then that of a typical feed-forward neural network, VAEs have a lot of added benefits. I’ve been recently playing with one of the more complicated VAE models: Attend, Infer, Repeat (AIR) by Ali Eslami et. al. from DeepMind, and I must say it’s really cool. In this blog post, I will describe the model and break it down into simple components. We will also cover parts of the implementation and some issues I had while implementing it. The full implementation is available here.

What does AIR do?

AIR aims to reconstruct an image, but instead of doing it in a single shot, it focuses on interesting image parts one-by-one. The figure below demonstrates AIR’s inner workings. It takes a look at the image, figures out how many interesting parts there are and where they are in the image. It then reconstructs them by painting one-part-at-a-time onto a blank canvas. AIR takes a look at the image, figures out how many interesting parts there are, and reconstructs it by painting one-part-at-a-time onto a blank canvas. Sounds easy enough? Well, it’s not, and for two reasons:

It’s completely unsupervised,
It takes a variable yet discrete number of steps.

The first one is tricky, because we don’t really know how to define an object or an interesting part (more on this later). The second leads to discrete latent variables, which are not-that-easy to deal with when computing gradients.

Let’s go back to the figure. AIR is called Attend, Infer, Repeat for a reason:

It attends to a part of an image using spatial transformers, effectively cropping it,
then it infers the latent variables best describing the crop,
and finally it repeats the procedure for the rest of the image.

Technically, the order is different, because it has to infer presence of an object and its location before attending to it; and the name describes only the inference process, not reconstruction.

What’s beautiful, is that we get a variable-length representation of the image: the more complicated the image is, the longer the representation will be. What’s even better, is that we know that each piece of description is tied to a particular location (and hopefully an object), which allows explicit reasoning about objects and relations between them.

Results

Measuring performance of generative models is always tricky, and I’d recommend this paper for a discussion. Here are some plots similar to the ones reported by the AIR paper. The first row of the topmost figure shows the input images, rows 2-4 are reconstructions at steps 1, 2 and 3 (with marked location of the attention glimpse in red, if it exists). Rows 5-7 are the reconstructed image crops, and above each crop is the probability of executing 1, 2 or 3 steps. If the reconstructed crop is black and there is “0 with …” written above it, it means that this step was not used (3rd step is never used, hence the last row is black). Click on the image for a higher-resolution view.

At every time-step, AIR chooses where to look in the image. The image on the left hand-side visualises the localisation policy of the spatial transformer, with red corresponding to the first step and green to the second. We see that the scanning policy is spatial with the majority of first steps located on the left hand-side of the image. The plot on the right hand-side is the counting accuracy on the validation set while training for 300k iterations, evaluated every 10k iterations.

AIR can reach almost 100% accuracy in counting objects, but this outcome does heavily depend on initialisation. Very often (80% of the time) the model converges to either zero or the maximum number of steps and fails to converge to the preferred solution.

Why and how does it work?

Like every VAE, AIR is trained by maximising the evidence lower bound (ELBO) \(\mathcal{L}\) on the log probability of the data:

\[\begin{align*} \log p(x) &= \mathcal{L}(\theta, \phi) + KL(q_\phi(z \mid x) \mid\mid p(z \mid x)),\\ \mathcal{L}(\theta, \phi) &= \mathbb{E}_{q_\phi(z)} [\log p_\theta(x \mid z)]] - KL(q_\phi(z\mid x) \mid\mid p(z)). \end{align*}\]

The first term of the ELBO is a probabilistic analog of the reconstruction error and the second term acts as a regulariser. For AIR, the second term tries to keep the number of steps low, but it’s also forcing the latent encoding of each image part to be as short as possible.

Short encoding means that the model has to focus on parts of the image that can be explained with relatively few variables. It turns out that we can define an object as an image patch, where pixel correlations within that patch are strong, but the correlation between pixels inside and outside of that patch is weak. We can also assume that pixels belonging to two different objects have very low correlation (as long as the two objects appear independently of each other). That means that explaining even small parts of two different objects at the same time leads to potentially longer encoding than explaining one (potentially big) object at a time. This leads, at least in case of uncomplicated backgrounds as in the paper, to a model which learns to take the minimum number of steps possible, where every step explains an internally-consistent part of the image.

What do we need?

We will start by defining a few core components. AIR is an autoencoder, and we will need an encoder and a decoder, but there’s more than that, namely:

Input encoder: transforms the input image \(x\) into some hidden representation \(v\).
RNN: Since we’re taking multiple peeks at the image, we need some hidden state \(h\) to keep track of what has already been explained. It creates the new hidden state as

\[\begin{align} h^{i+1} = RNN(v, h^i, z^i), \end{align}\]

where \(z^i = \{z^i_{what}, z^i_{where}, z^i_{pres}\}\) are the latent variables describing the appearance, location and presence of an object, respectively.

Presence & Location models: Given the hidden state \(h^i\), they predict \(z^i_{pres}\) and \(z^i_{where}\).
Spatial Transformer: Given the location parameters \(z^i_{pres}\), it extracts a crop of the original input image \(x^i_{att}\). It will later place a reconstructed crop \(y^i_{att}\) into the canvas.
Glimpse encoder: It encodes \(x^i_{att}\) into a low-dimensional latent representation \(z^i_{what}\).
Glimpse decoder: It decodes \(z^i_{what}\) in the reconstructed glimpse \(y^i_{att}\).

I defined all these components in a single file as Sonnet modules. Since we don’t want to dwell on complicated architectures, I used small multi-layer perceptrons (MLPs) with 2 hidden layers of 256 units each and ELU nonlinearities for every component. My RNN is a 256-dimensional LSTM core from Sonnet with a trainable initial state. I put together all the modules into a working tf.RNNCell in the cell.py.

Probability Distributions

One reason why VAEs are more complicated than standard neural nets are the probability distributions. Each of the latent variables \(z\) is not just predicted by the corresponding model; the model predicts parameters of a probability distribution, and then we randomly sample from it. \(z_{what}\) and \(z_{where}\) both come from Gaussian distributions with diagonal covariance matrices, whose means and variances are predicted by MLPs (glimpse encoder and location model, respectively). I used tf.NormalWithSoftplusScale for numerical stability of the scale parameters. \(z_{pres}\) is much more tricky. At inference time, it comes from a Bernoulli distribution parametrised by an output of the presence model. When the previous sample was equal to 1, we take the current sample as is. As soon as we draw a sample equal to zero, however, all subsequent samples have to be set to zero, too. This ancestral-sampling scheme results in a modified geometrical distribution, for which we have to account when we implement the KL-divergence with the prior. For this reason, I implemented a NumStepsDistribution in prior.py that creates the modified geometric distribution given Bernoulli probabilities at consecutive steps.

Piors

Every VAE requires a prior on its latent representation. AIR requires at least three priors for three different latent variables. I used a Normal(0, 1) prior for both \(z_{what}\) and \(z_{where}\) and a modified geometric-like prior for \(z_{pres}\) (number of steps). Setting its success probability is tricky, though. The paper mentions only that it is uses a “geometric prior which encourages sparse solutions”, which tells us only that the success probability in the geometric distribution is low. When I emailed the author, I found out that he annealed the success probability from a value close to 1 to either \(10^{-5}\) or \(10^{-10}\) depending on the dataset over the course of 100k training iterations.

Intuitively, it makes sense. At the beginning of training, we would like the model to take a positive number of steps so that it can learn. The further we go into the training, the more we can constrain it. Very low values of the success probability are important, because the reconstruction loss is summed across the whole image (it has to be: in the derivation of the loss, pixels are assumed to be conditionally independent given \(z_{what}\) and log probability of independent events results in a sum) and KL-divergence has to compete with it during the optimisation.

Estimating Gradients for Discrete Variables

Discrete variables, or more specifically, samples from a discrete probability distribution, are difficult to back-propagate through. AIR uses a score-function estimator, otherwise known as REINFORCE. More about it here. This estimator is difficult to work with, because the estimate has a high variance. It expresses the gradient of an expectation of a smooth function (here \(\mathcal{L}\)) as the expectation of the gradient of the log-probability with respect to which the expectation is taken multiplied by that function.

\[\begin{align} \nabla_\phi \mathbb{E}_{q_\phi(z)} [ \mathcal{L} (z)] = \mathbb{E}_{q_\phi(z)} [\mathcal{L}(z) \nabla_\phi \log q_\phi(z) ] \end{align}\]

It turns out that the expectation of this expression is equal to zero, and therefore we can add an arbitrary term with zero expectation without changing the result. If what we add is negatively correlated with \(\mathcal{L}\), we will reduce variance. AIR uses “neural baselines” and cites Neural Variational Inference and Learning in Belief Networks by A. Mnih and K. Gregor, but doesn’t give much detail.

Do we really need to reduce variance? Well, yes. I’ve measured variance on a per-parameter basis for the AIR model. Back-propagation results in variance on the order of \(10^{-2}\). There is some variance, as we’d expect from Stochastic Gradient Decent, but it’s not huge. Due to discrete latent variables, gradient of some of the parameters comes only from the REINFORCE formulation, and its variance is on the order of \(10^3\). It’s five orders of magnitude higher, and I wouldn’t expect it to be very useful for training. The neural baseline reduces the variance to about \(10^{-1}\). It’s still higher than from back-prop, but usable.

I used an MLP with 2 hidden layers of 256 and 128 neurones, respectively, with a single output unit. As input, I used the original flattened image concatenated with all latent variables produced by the main model. The baseline is trained to minimise the mean-squared error with the current reconstruction error (\(-\mathbb{E}_{q_\phi(z)} [\log p_\theta(x \mid z)]\)) of the main model as the target. The learning rate used for training this auxiliary model was set 10 times higher than the learning rate of the base model.

To see how REINFORCE with a neural baseline is implemented, have a look at the AIRModel._reinforce method in model.py.

Issues

My implementation is very fragile. It recovers the performance reported in the paper once for about 5 training runs. I’m not saying it’s an issue with the model, it’s probably just my implementation. If anyone has ideas how to improve it, please let me know.
If I change the multi-MNIST dataset to have smaller digits, the model doesn’t count as well (number of steps is wrong). That’s probably an issue of my implementation, too.
It is sensitive to initialisation of the output layers that produce the final reconstruction but also of the “where” and “pres” latent variables. If the reconstruction has too big values at the beginning of the training, the number of steps shrinks to zero and the model never recovers. Similar things happen when “where” latent variable has too big a variance at the beginning. This behaviour is obvious in hindsight, but it wasn’t that clear while implementing.

Conclusion

It’s a really cool if a bit complicated model. I hope this post has brought you closer to understanding of what’s going on in the paper. I’ve implemented it because I have a few ideas on how to use it in my research. Feel free to reach out if you have any questions or comments.

Adam Kosiorek

Geometry in Text-to-Image Diffusion Models

Text-to-Image Models Know About Geometry

Extracting Geometry from an Image Model

DreamFusion/SJC Algorithm

Why Does Extracting Geometry Lead to Cartoonish Objects?

Why Only Objects? What Happened to Full 3D Scenes?

View-Conditioned Follow-ups

Injecting Geometry into an Image Model

Technical: SceneScape Algorithm

Why is it important for the image model to be text conditioned?

Conclusions

Acknowledgements

Footnotes

Masking for Representation Learning in Vision

Masking and the Brain

BERT or Why Inpaint for Representation Learning?

What is a Visual Word?

Finding Visual Words

Visual Words from Before Deep-Learning

Learning Visual Words

Adversarial Inference-Occlusion Self-supervision (ADIOS)

How Does Masking Apply to Reconstruction-Free Learning (RFL)?

Summary

Acknowledgements

Footnotes

Machine Learning of Sets

Notation

Set To Vector

Vector To Set

Permutation-Invariant Loss Functions

Gradient Descent to the Rescue!

Attention is All You Need, Really

What about those Point Processes??!!

Set To Set

Outlook and Conclusion

Further Reading

Footnotes

Acknowledgements

Stacked Capsule Autoencoders

Why do we care about equivariances?

Capsules learn equivariant object representations

Can an arbitrary neural net learn capsule-like representations?

Why bother?

Technical bits

How can we turn object prototypes into an image?

Where do we get capsule parameters from?

Inferring parts and poses

Organizing parts into objects

Summary

Further reading:

Footnotes

Acknowledgements

Forge, or how do you manage your machine learning experiments?

Forge

Configs

Smart checkpoints

Working Example

Closing thoughts

Acknowledgements

Normalizing Flows

Change of Variables in Probability Distributions

Simple Flows

Planar Flow

Radial Flow

Discussion

Autoregressive Flows

Real Non-Volume Preserving Flows (R-NVP)

Autoregressive Transformation

Masked Autoregressive Flow (MAF)

Inverse Autoregressive Flow (IAF)

MAF vs IAF

Further reading

Acknowledgements

What is wrong with VAEs?

Latent Variable Models

Rise of a Variational Autoencoder

What is wrong with this estimate?

What is wrong with IWAE?

Better estimators