I've been playing around with CNNs for about a year now, taking open-source projects and trying to run the inference codes and doing some training. Despite this experience,
I never really understood what was happening in the convolutional layers.
For example, what happens when I change the kernel size, or stride length, or add additional padding to my convolutional layers?
In order to truly master the art of deep learning, I felt that I had to go back to basics and analyse the convolution operation in isolation.
This blog post documents my attempt at understanding the convolution operation a little more.
For this experiment, I will be playing around with the Conv2d operation from PyTorch
I will be changing the following parameters and observing the difference it makes to the convolution output for a sample image (shown below):
To ensure that I get consistent results, I initialize the kernel to a constant value of 0.1, with zero bias.
For the first experiment, I ran the test image through the Conv2d operation using kernel sizes of 1 to 10. The observation of the output obtained was
that as kernel size increased, the output image became more blurry. As can be seen in the GIF below, showing the output image change with an increase in kernel size.
This makes sense if we look more closely at what the 2D convolution operation does.
Essentially, convolution consists of six steps:
For an input image (or array, or matrix) of X-number of columns by Y-number of rows, the kernel starts at the top-left most corner of the input.
In this position, for every element in the kernel, it is multiplied by the value of the element in the same position as it is in the input. For a kernel of size 2 (2 rows and 2 columns), we would have top-left, top-right, bottom-left and bottom-right elements.
The product of each element-wise multiplication operation is added together.
The kernel takes a stride (or a step) down the row, depending on the stride length. If the stride length is 2, then the kernel moves two pixels down.
The entire process is repeated until the kernel reaches the end of the row, then it moves down the start at the next column, and so on.
After the kernel reaches the end of the input, all the summation outputs are arranged in the order they were produced, and based on which row in the original input they were produced from, to form the final output
For a visual illustration, refer to the GIF below.
What this means is that each time the kernel works goes over a set of pixels, it aggregates the total value of all the pixels in its area of influence.
The larger the kernel, the more the fine-grained details in an image are absorbed into the aggregate, whereas a small kernel will retain each pixel's
relative strength over neighbouring pixels, thus preserving the details of the image.
As such, we can conclude that a small kernel size is useful for extracting detailed features, while a large kernel size is good for getting
the general structure of an object.
On the application side, the visualised results above also show us how convolutions using a larger kernel size present a use case for blurring effects on images.
For the second experiment, I kept the kernel size constant at 1. This time, I wanted to test the effect of changing the stride length on the convolution result.
For a stride length of 1 to 10, the convolution result is as shown in the GIF below.
From the GIF, we can see how increasing the stride length causes a severe drop in image resolution. It is also interesting to note that the image details become patchy
rather than smooth unlike when the kernel changes.
Similar to increasing the kernel size, increasing stride length causes a loss in information from the input image. This is because the kernel skips the number of pixels equal to the stride length (Step 4 of the convolution operation summary above). For a kernel size of only 1, as used in my experiments, this causes information in the skipped pixels to be lost in the final output.