Learning About Convolutions Using PyTorch

I've been playing around with CNNs for about a year now, taking open-source projects and trying to run the inference codes and doing some training. Despite this experience, I never really understood what was happening in the convolutional layers.

For example, what happens when I change the kernel size, or stride length, or add additional padding to my convolutional layers? In order to truly master the art of deep learning, I felt that I had to go back to basics and analyse the convolution operation in isolation.

This blog post documents my attempt at understanding the convolution operation a little more.

Experiment Setup

For this experiment, I will be playing around with the Conv2d operation from PyTorch

I will be changing the following parameters and observing the difference it makes to the convolution output for a sample image (shown below):

kernel size
stride length

To ensure that I get consistent results, I initialize the kernel to a constant value of 0.1, with zero bias.

Sample image taken from the KITTI driving dataset

Experiment 1: Varying Kernel Size

For the first experiment, I ran the test image through the Conv2d operation using kernel sizes of 1 to 10. The observation of the output obtained was that as kernel size increased, the output image became more blurry. As can be seen in the GIF below, showing the output image change with an increase in kernel size.

Convolution output by varying kernel size from 1 to 10. Small kernel size produces a detailed output, while a large kernel produces a blurry image

This makes sense if we look more closely at what the 2D convolution operation does.

Essentially, convolution consists of six steps:

Step 1: Positioning of kernel

For an input image (or array, or matrix) of X-number of columns by Y-number of rows, the kernel starts at the top-left most corner of the input.

Step 2: Element-wise Multiplication

In this position, for every element in the kernel, it is multiplied by the value of the element in the same position as it is in the input. For a kernel of size 2 (2 rows and 2 columns), we would have top-left, top-right, bottom-left and bottom-right elements.

Step 3: Summation of Products

The product of each element-wise multiplication operation is added together.

Step 4: Striding

The kernel takes a stride (or a step) down the row, depending on the stride length. If the stride length is 2, then the kernel moves two pixels down.

Step 5: Rinse & Repeat

The entire process is repeated until the kernel reaches the end of the row, then it moves down the start at the next column, and so on.

Step 6: Arrange Output

After the kernel reaches the end of the input, all the summation outputs are arranged in the order they were produced, and based on which row in the original input they were produced from, to form the final output

For a visual illustration, refer to the GIF below.

Illustration of a convolution operation using a kernel with size 3 (3x3) over a 5x5 image. The small red values in the yellow boxes represent the kernel values.

Explanation

What this means is that each time the kernel works goes over a set of pixels, it aggregates the total value of all the pixels in its area of influence.

The larger the kernel, the more the fine-grained details in an image are absorbed into the aggregate, whereas a small kernel will retain each pixel's relative strength over neighbouring pixels, thus preserving the details of the image.

As such, we can conclude that a small kernel size is useful for extracting detailed features, while a large kernel size is good for getting the general structure of an object.

On the application side, the visualised results above also show us how convolutions using a larger kernel size present a use case for blurring effects on images.

Experiment 2: Varying Stride Length

For the second experiment, I kept the kernel size constant at 1. This time, I wanted to test the effect of changing the stride length on the convolution result. For a stride length of 1 to 10, the convolution result is as shown in the GIF below.

From the GIF, we can see how increasing the stride length causes a severe drop in image resolution. It is also interesting to note that the image details become patchy rather than smooth unlike when the kernel changes.

Convolution output by varying stride length from 1 to 10. Blurring effect is more prominent when changine stride length compared to changing kernel size

Explanation

Similar to increasing the kernel size, increasing stride length causes a loss in information from the input image. This is because the kernel skips the number of pixels equal to the stride length (Step 4 of the convolution operation summary above). For a kernel size of only 1, as used in my experiments, this causes information in the skipped pixels to be lost in the final output.