Deep Learning for Computer Vision: Giving Computers the Power of Sight

Deep Learning for Computer Vision: Giving Computers the Power of Sight

Content

Giving Computers the Power of Sight: An Introduction to Deep Learning for Computer Vision

Welcome to Day 2 of MIT's Introduction to Deep Learning! Today, we're diving into the fascinating world of computer vision – teaching computers to 'see' and understand the world around them, much like humans do.

What is Computer Vision?

Vision is a critical human sense, enabling us to interpret emotions, navigate our surroundings, and interact with objects. Our goal is to empower computers with this same ability: to understand the physical world from raw visual inputs. A simple way to define this is: the ability to know what is where by looking.

However, vision goes beyond simply identifying objects. It involves understanding dynamics, movement, and context. For instance, when looking at a street scene, we instantly differentiate between parked cars and moving cars, or pedestrians standing versus walking. Vision is about holistically interpreting an image, considering how things are moving and changing over time.

The Deep Learning Revolution in Computer Vision

Deep learning is revolutionizing computer vision, impacting various fields, including:

  • Robotics: Enabling manipulation and autonomy.
  • Mobile Computing: Powering features like facial detection on your phone.
  • Biology & Healthcare: Assisting with medical image analysis and diagnostics.
  • Autonomous Driving: Guiding selfdriving cars.
  • Accessibility: Creating tools for visually impaired individuals.

Computer vision algorithms are becoming increasingly pervasive. Facial detection, for example, is a fundamental application with significant impact. It allows us to detect faces and interpret subtle emotions.

Selfdriving cars represent a classic computer vision challenge. These systems take raw camera imagery as input and, using complex models, interpret the scene to navigate autonomously. One particular project at MIT focused on creating an autonomous vehicle capable of driving on completely new, neverbeforeseen roads, relying purely on visual input.

How Computers See Images: Representing Visual Data

To a computer, an image is simply an array of numbers. A grayscale image is a twodimensional array, where each pixel is represented by a single number indicating its gray intensity. A color image becomes a threedimensional array, with each pixel represented by three numbers corresponding to its red, green, and blue (RGB) values.

Regression vs. Classification: Two Core Tasks

Two common machine learning tasks are regression and classification.

  • Regression: Outputs a continuous value.
  • Classification: Outputs a class label (e.g., identifying a US president from an image).

To classify images effectively, models need to understand the unique features that distinguish different classes. This leads to the critical task of feature detection – understanding what makes one class different from another.

Feature Detection: The Key to Image Classification

Image classification can be simplified into two steps: defining what you're detecting (feature definition) and then detecting those features. For example, to detect a face, you might look for features like eyes, nose, and ears.

The challenge lies in the fact that defining these features is a complex and recursive problem. Faces vary significantly in viewpoint, scale, lighting, and occlusion. This variation makes computer vision challenging.

Robust classification algorithms must be invariant to these changes and modifications. This is where deep learning excels. In traditional machine learning, feature definition is a human process. In deep learning, features are learned from data.

Neural Networks: Learning Features from Data

With neural networks, instead of manually defining features (e.g., detecting faces by looking for eyes), we show the model many images of faces and allow it to extract the unique characteristics that distinguish faces from other objects (e.g., cars). This process is hierarchical, with each layer of the network adding complexity and expressiveness.

While fully connected networks could theoretically be used, they require flattening the twodimensional image into a onedimensional input, destroying valuable spatial information. They are also computationally expensive due to the large number of parameters.

Convolutional Neural Networks (CNNs): Preserving Spatial Information

Instead of flattening images, CNNs leverage the spatial structure inherent in the data. They connect input pixels in patches to neurons, preserving spatial relationships. This is accomplished through a mathematical operation called convolution.

Convolution Explained

A convolution involves sliding a filter (a small matrix of weights) over the image, performing elementwise multiplication between the filter and the corresponding image patch, and then summing the results. This process is repeated across the entire image.

The choice of filter is crucial. Filters can be handengineered (e.g., for edge detection) or, more commonly, learned through training. By learning the filters, CNNs can automatically extract the most relevant features for a given task.

The Three Main Operations in CNNs

  1. Convolutions: Patchwise processing of images.
  2. Nonlinearities: Applied after convolutions to introduce nonlinear expressivity.
  3. Pooling: Downsampling operation to increase the receptive field.

Visualizing Learned Filters

During or after training, you can visualize the filters that the CNN has learned. For example, in a face detection network, the initial layers might learn to detect edges, while deeper layers combine these edges to form facial features like eyes, noses, and ears. The network learns these filters from data, without any explicit human definition.

Beyond Classification: CNNs for Various Tasks

CNNs are not limited to classification. The feature extraction capabilities of CNNs can be combined with different downstream tasks, such as:

  • Object Detection: Predicting the location and class of multiple objects in an image.
  • Semantic Segmentation: Classifying each pixel in an image, creating a pixelwise semantic understanding of the scene.
  • Autonomous Navigation: Learning to control a vehicle based on raw sensor data (e.g., camera images) and map information.

Object Detection: Locating and Identifying Objects

Object detection involves not only classifying an object (e.g., identifying a taxi) but also predicting its location in the image using a bounding box.

A naive approach would be to randomly place boxes over the image and classify each one, which is computationally expensive. A more intelligent approach would be to use heuristics to propose boxes around potentially interesting regions, but that is still slow.

A solution to this is RCNN (Regionbased Convolutional Neural Network), where the model learns region proposals within the same network that performs classification. This shares features across the endtoend pipeline.

Semantic Segmentation: PixelLevel Classification

Segmentation involves classifying every single pixel in an image. For example, the pixel might be classified to be part of a cow, a grass area, the sky, etc. The model first uses CNNs to downsample the feature map, and then uses upsampling convolutions to predict pixel classification.

Autonomous Navigation: Learning to Drive

CNNs can be used to learn autonomous navigation models based on raw camera data and noisy street view maps, effectively teaching the vehicle how to drive. The model learns the good features that can be used to control the vehicle. This approach makes autonomous driving possible even in novel conditions.

Conclusion: The Pervasive Power of CNNs

The impact of CNNs is widereaching, and their power stems from a few fundamental techniques: convolution, feature learning (optimized filters), and up/down sampling to modulate spatial resolution. These techniques have transformed computer vision and enabled a wide range of applications.

Deep Learning for Computer Vision: Giving Computers the Power of Sight | VidScribe AI