Designing for Equivariance to Perceptual Variation in Images

By Yulong Yang, Felix O’Mahony and Christine Allen-Blanchette
Allen-Blanchette Group, Princeton University

Introduction

A huge challenge for neural networks is generalizing to out-of-distribution data. For example, a neural network classifier trained on blue digits may perform well when presented with a previously unseen blue digit, but fail to correctly classify green digits.


Figure 1: The challenge of generalization. A neural network classifier trained on blue MNIST (Lecun, et el., 2002) digits is able to consistently produce the correct labels when tested on blue MNIST digits (98.46% accuracy). However, the same model is unable to reliably predict the label of green MNIST digits (42.62% accuracy).

This performance degradation can be attributed to the fact that green digits are not from the same distribution as blue digits. Designing networks that perform well in the presence of perceptual variation is important in real-world settings and settings where data may be scarce (e.g., robotics, and medical imaging).

One way to build neural networks that generalize to known transformations is to constrain the network architecture to be transformation equivariant. A function is said to be equivariant if a change to the input yields a corresponding change to the output. Mathematically, we describe a function f as equivariant to transformations g if

f (g · x) = g · f (x)

Neural networks that are equivariant to a particular transformation class can generalize their performance on one example to all transformed instances of that example. As a consequence, equivariant neural networks can achieve better performance than conventional neural networks even when designed with fewer network parameters and trained on fewer examples. A good amount of work has been done in the area of transformation-equivariant neural networks. Other researchers have developed neural networks equivariant to geometric transformations, including translations, rotations, scaling, and mirroring. We developed a neural network that is equivariant to perceptual transformations, including changes to hue, saturation, and luminance.

The Hue-Saturation Luminance Color Space

In digital image processing and the human visual system, it is common to represent color as a combination of red, green, and blue values. We call this representation space the Red-Green-Blue (RGB) color space. We can visualize the RGB color space as a cube, where each axis indicates the strength of the red, green or blue color channel (see Figure 2a).

Another way to represent color is in the Hue-Saturation-Luminance (HSL) color space. Instead of a cube, the HSL color space is usually visualized as a cylinder (see Figure 2b). In this color space, the hue channel defines the base color, the saturation channel defines how vibrant or ‘washed-out’ the base color appears, and the luminance (or lightness) channel defines how bright the base color appears.


Figure 2: RGB and HSL color spaces. (a) The RGB color space is visualized as a cube1, and (b) the HSL color space is visualized as a cylinder2

Color Equivariant Network

We use the geometry of the HSL color space to design a neural network that is equivariant to color transformations. First, we consider the question: What is the geometry of the HSL color space? Looking at the image in Figure 2b, we see that the hue channel has the same geometry as the circle (i.e., the 2D rotations), and geometry of the saturation and lightness channels can be approximated with the geometry of a line (i.e., the 1D translations).

Next, we consider the question: How do we represent an input image using this geometry? A neural network equivariant to a specific transformation typically uses a lifting layer to give an input image the geometry of the transformation space. For example, if we wanted to design a neural network that could recognize images of objects at 0 degree, 90 degree, 180 degree, and 270 degree rotations, a lifting layer could take an input image and transform it by each of these rotations before passing it to a rotation-equivariant neural network. In our color equivariant network, we want our network to recognize objects in images regardless of their color, so we lift input images to the HSL transformation space by varying their hue, saturation and luminance values (see Figure 3b) then pass them to a neural network that is equivariant to the HSL geometric transformations (i.e., the 2D rotations and 1D translations).

Let’s take a closer look at how perceptual transformations of an input result in geometric transformations of the representations in our network. First, we’ll look at hue transformations. Hue transformations present as 2D rotations in our network. In Figure 4a, we show the network representations for an input image (top-left) and a hue-shifted version of that image (bottom-left). The hue of the input image is shifted 90◦, and the network representations are cyclically shifted (i.e., rotated) (to the left) by one. Next we’ll look at saturation transformations. Saturation transformations present as 1D rotations in our network. In Figure 4b, we show the network representations for an input image (top-left) and a saturation-shifted version of that image (bottom-left). The saturation of the input image is shifted by 10%, and the network representations are shifted (i.e., translated) to the right by one. Finally, looking at luminance transformations, Figure 4c shows the network representations for an input image (top-left) and a luminance-shifted version of that image (bottom-left). The luminance of the input image is shifted by 10%, and the network representations are shifted (i.e., translated) to the right by one.

Looking closely at the representations in Figure 4, we can see that the representations associated with saturation and luminance transformations aren’t perfectly shifted. Differences between the representations of the original and transformed images are easiest to spot at the boundary. Check out Figure 4c; the last feature map of the original image representation should look like the second to last feature map of the transformed image representation, but it doesn’t. Why is that? Differently from hue transformations where the geometry we selected is a perfect match, the geometry we selected for saturation and luminance transformations is only an approximation, so we don’t have perfect equivariance. As it turns out, there isn’t a way to represent the geometry of saturation and luminance transformations for perfect equivariance; they just don’t have the right structure – but that’s okay, our approximation gives pretty good results in practice.


Figure 3: Color equivariant network. Our color equivariant network is equivariant to hue, saturation, and luminance transformations. (a) A hue transformation of the input (left) results in a rotation of the feature maps (right). (b) Input images are lifted to the HSL space (only hue and saturation are visualized here) before being passed to a neural network that is equivariant to the HSL geometric transformations.

Figure 4: Equivariant feature maps. We show the (a) hue-, (b) saturation-, and (c) luminance-equivariance of our model. Corresponding feature maps are highlighted with a blue border.

Now What?

In the introduction, we argued for transformation equivariant neural networks by pointing to their improved generalization performance over conventional neural networks in the presence of a particular transformation class. The obvious question is: does our color equivariant neural network outperform convention neural networks in the presence of color variation?

NetworkA/AA/BParams
Z2CNN98.46 ± 0.1042.62 ± 22.0622,130
Hue-3*98.21 ± 0.2598.19 ± 0.2922,658
Table 1: Hue-shifted MNIST digit classification. Classification accuracy on the hue-shifted MNIST dataset is reported. Our model (Hue-3) achieves significantly improved generalization performance over the conventional CNN model (Z2CNN).

We can answer this question by comparing the performance of our color equivariant network and a conventional network on the Hue-shifted MNIST dataset. This dataset is partitioned into two sets, set A and set B. The important thing here is that the colors of the digits in set A are different from the colors of the digits in set B, this is what allows us to test color generalization performance (recall Figure 1). When we train and test on the same colors (experiment A/Ain Table 1), both our color equivariant network and the conventional network perform really well. However, when we train on one set of colors and test on the other (experiment A/B in Table 1), the performance of our color equivariant network stays the same, and the performance of the conventional network drops significantly (from 98.46% accuracy to 42.62% accuracy).

Why does this work so well? By enforcing color equivariance, our network organizes color information coherently, even colors it hasn’t seen before; this is not true of the conventional network. We can see this by looking at the representations used for classification in each network (see Figure 5). The representations learned with our network (right) transform minimally in response to hue shifts, and cluster by digit. In contrast, the representations learned in the conventional network (left) transform drastically in response to hue shifts resulting in regions of the feature space where representations of different digits have significant overlap.

Why Does This Matter?

In the last section, we presented an example (Hue-shift MNIST digit classification) where our color equivariant network outperformed conventional networks in the presence of color variation. But outside of this toy example, is color equivariance really useful? Definitely! A good representation of color is important for answering a wide variety of real-world questions including: Is this fruit ripe? What kind of bird is this? And, are these cells cancerous?

The lymph nodes are often the first place breast cancer will spread, making classification of breast cancer in the lymph nodes one of the most important ways to predict the stage of cancer and determine the course of treatment. The Camelyon17 dataset consists of images of human tissue collected from five different hospitals. Variation in how a hospital collects and processes tissue samples can result in significant variation in tissue images. In Figure 6, we show tissue images from each hospital represented in the dataset, and the saturation statistics of images from each hospital. The difference in saturation statistics makes classification with conventional networks challenging (see Table 2). In contrast, our color equivariant network performs well, illustrating the importance of a good representation of color for real-world data.


Figure 5: Hue-shift MNIST feature map visualization. We compare the feature map trajectories of MNIST digits as their hue is varied from 1◦ to 360◦. The color of the trajectory corresponds to the class label. (a) t-SNE projection of hue shifted feature map trajectories in the Z2CNN baseline model. As the hue of the input changes, the location of the digit in the feature space changes significantly. (b) t-SNE projection of hue shifted feature map trajectories in our hue-equivariant CNN. In contrast to the Z2CNN baseline, the location of the digit in the feature space changes minimally.

Figure 6: Camelyon-17 dataset saturation statistics. Example images and saturation statistics by hospital.
NetworksAccuracyNetwork Parameter
ResNet5071.09 ± 7.5823.5M
Sat-3*83.92 ± 2.6823.3M
Table 2: Camelyon-17 classification. Classification accuracy on the Camelyon-17 dataset is reported. Our model (Sat-3) achieves significantly improved generalization performance over the conventional CNN model (ResNet50).

This work was published as a conference paper at ICLR 2025.

  1. RGB cube: RGB_color_solid_cube.png by SharkD under a CC-BY-SA 4.0 license. Retrieved from Wikimedia Commons ↩︎
  2. HSL cylinder: HSL_color_solid_cylinder.png by SharkD under a CC-BY-SA 3.0 license. Retrieved from Wikimedia Commons ↩︎

Leave a Reply

Your email address will not be published. Required fields are marked *