Convolutional Neural Networks: Processing Images with Spatial Awareness

Convolutional Neural Networks: Processing Images with Spatial Awareness

Content

Convolutional Neural Networks (CNNs): A Deep Dive

Welcome back to our lecture series! Today, we're diving into the world of Convolutional Neural Networks (CNNs), the cornerstone of image processing models we'll be using moving forward. Remember our discussion on backpropagation? That powerful algorithm allowed us to compute gradients in complex computational graphs, making gradient calculations manageable even in intricate models.

The key takeaway from our backpropagation discussion was the local viewpoint. To integrate a function into a computational graph, we only needed to implement a small, local operator that knew how to compute outputs during the forward pass and gradients during the backward pass. This modular approach allows us to easily plug in new function types into our computational graphs.

The Problem with Traditional Classifiers

We've explored linear classifiers and fully connected neural networks. While powerful, both share a critical flaw: they fail to respect the 2D spatial structure of input images. These classifiers require us to flatten our images into long vectors, effectively destroying the spatial relationships between pixels. This is a significant problem when dealing with image data.

The CNN Solution: Respecting Spatial Structure

The solution? Relatively straightforward, now that we have all this computational graph machinery. We need to introduce new operators into our neural networks that can operate on spatially structured data. This leads us to the core of CNNs: Convolutional layers, pooling layers and normalization layers. These new operators will operate on the 2D spatial data.

From Fully Connected to Convolutional Networks

Let's recap the basic building blocks of fully connected networks: fully connected layers (matrix multiplication) and nonlinear activation functions (like ReLU). To transition to CNNs, we introduce three key operations:

  • Convolution Layers
  • Pooling Layers
  • Normalization Layers

Convolution Layers: Respecting Spatial Structure

The convolution layer is designed to overcome spatial structure loss. Fully connected layers destroy the spatial information, however, Convolutional layers will have learnable weights that will now respect the 2D spatial structure of our inputs.

Consider the fully connected layer: during the forward pass, it receives a vector (perhaps a flattened image) and multiplies it with a weight matrix to produce an output vector. The convolutional layer retains this flavor but operates on a threedimensional tensor, a volume that preserves spatial information.

For example, a CIFAR10 image, at the first layer of a CNN, would be a 3x32x32 tensor, where 3 represents the color channels (red, green, blue), and 32x32 represents the height and width. The **weight matrix in a convolutional layer, also called a filter or kernel**, has a threedimensional spatial structure as well. A common size is 3x5x5. The depth dimension of the input tensor **must** match the depth dimension of the convolutional filter.

The convolutional filter slides over all spatial positions in the input image to compute another threedimensional representation. The convolution extends over the full depth of the input tensor. The key is that this tiny convolutional filter will then align itself to some little 3x5x5 chunk of that input tensor and then compute a dot product between the filter and the corresponding elements of the input tensor. A bias term is added as well.

This process results in a single scalar number, indicating how well the input tensor matches the filter at that position. By sliding the filter across the input tensor, we generate a grid of such numbers. For a 32x32 input and a 5x5 filter, this results in a 28x28 grid. This grid is called an activation map.

A convolutional layer typically involves convolving the input with a set of different filters, each learning different features. Convolving our image with six convolutional filters, produces six activation maps (each a 1x28x28 output plane). These activation maps show how much each position in the input responds to each filter.

All filters can be assembled into a single fourdimensional tensor (e.g., 6x3x5x5, where 6 is the number of filters). These activation maps are combined into a single threedimensional tensor (e.g., 6x28x28). The convolution layer preserves the spatial structure.

The output can be considered as a set of feature maps, or as a grid of feature vectors. Both interpretations are useful. The layer often operates on batches of images, represented by a fourdimensional tensor of shape N x C_in x H x W.

Stacking multiple convolutional layers creates a neural network with convolutional layers as the core elements. Each convolutional layer is performing a linear operation. Therefore, **Nonlinear activation functions, such as ReLU, must be placed in between each of our convolution operations** in order to solve for this problem.

The learned weights can be visually inspected. A convolutional network learns a set of templates that are small and local in size, such as edges of different orientations and colors.

Spatial Dimensions of Convolution

Consider a 7x7 input image convolved with a 3x3 filter. The output size is 5x5 (W K + 1, where W is the input size and K is the kernel size). Every convolution operation reduces the spatial dimensions of the input tensor which puts some constraints on the depth of the networks that we might be able to train. **Padding solves this problem.**

Padding: Preserving Spatial Dimensions

Padding involves adding extra pixels (typically zeros, called zeropadding) around the image borders before convolution. This allows us to control the output size. Now, the output size is W K + 1 + 2P, where P is the padding value.

A common strategy is *same padding*, where P = (K 1) / 2, ensuring the output has the same spatial size as the input. The amount of padding must be specified as a hyperparameter.

Receptive Field: The Input Area a Neuron "Sees"

Each spatial position in the output image depends only on a local region of the input image. A 3x3 convolution, means an element in the output tensor depends on a 3x3 region in the input tensor, called the receptive field.

Stacking convolution layers increases the effective receptive field. After two 3x3 convolutions, the receptive field becomes 5x5. After three, it's 7x7. The receptive field grows linearly with the number of convolution layers.

For highresolution images, many convolution layers are needed to achieve large receptive fields. To address this, we introduce another hyperparameter: stride.

Stride: Downsampling During Convolution

Stride determines the step size of the convolutional filter. A stride of 2 means the filter is placed every two positions, resulting in spatial downsampling. Formula: (W K + 2P) / S + 1, where S is the stride.

1x1 Convolution

A 1x1 convolution seems odd, but it's a valid operation. It performs a linear transformation independently on each feature vector in a threedimensional grid. This can be viewed as a fully connected neural network operating independently on each feature vector. It's useful as an adaptor inside of a neural network to match up dimensions in different points of the model.

Convolution Layer Recap

  • Input: 3D tensor
  • Hyperparameters: kernel size, number of filters, padding, stride
  • Weight matrix: 4D tensor
  • Output: 3D tensor

Common configurations include 3x3 stride 1 convolutions, 5x5 stride 1 convolutions, and 3x3 stride 2 convolutions for spatial downsampling.

Pooling Layers: ParameterFree Downsampling

Pooling layers downsample without learnable parameters. It operates by taking local receptive fields of a certain spatial size and applying a pooling function (max or average) to collapse the set of input values to one output value.

Max Pooling: Translation Invariance

A common example is 2x2 max pooling with a stride of 2. The input tensor is divided into 2x2 regions, and the maximum value within each region is selected as the output. This introduces some translational invariance.

Classical CNN Design: Combining the Building Blocks

A classical CNN design typically follows a pattern of Conv > ReLU > Pool, repeated several times, followed by fully connected layers. An example is LeNet5, used for character recognition.

Batch Normalization: Stabilizing Training

Very deep networks tend to run into a problem where it's difficult to get networks to converge. To overcome this, Batch Normalization can be used.

The Idea Behind Batch Normalization

Batch normalization normalizes the outputs of a layer to have zero mean and unit variance. This is done to reduce internal covariant shift, which stabilizes and accelerates the optimization process. The empirical mean and standard deviation can be averaged across the batch dimension.

The Batch Normalization Process

Empirically normalize, subtract the mean, and divide by the standard deviation. Learnable scale (gamma) and shift (beta) parameters are added back to allow the network to learn desired means and variances. During test time, the means and standard deviations are not computed empirically but instead set to the running exponential average of all means and standard deviations across all of the data that the network was trained on.

Batch Normalization in Convolutional Networks

Very common to add batch normalization in your networks directly after a convolutional layer or fully connected layer.

Upsides and Downsides

Batch Normalization increases network strain a lot faster and allows you to increase your learning rates. However, theoretical understanding of Batch Normalization is not well understood, and does something different between training and test time which can be an inconvenience for your code.

Layer Normalization and Instance Normalization

Layer normalization computes statistics across the feature dimension, while instance normalization computes statistics across the spatial dimensions. Batch Normalization only averages across the batch dimension and spatial dimensions.

Next Steps: Building Effective CNN Architectures

You now have the building blocks for CNNs. Next lecture, we'll explore best practices for combining these components to build effective architectures.

Convolutional Neural Networks: Processing Images with Spatial Awareness | VidScribe AI