Introduction to deep learning in computer vision

Basic architecture

CNN

Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions, parameter sharing and equivariant representations. Moreover, convolution provides a means for working with inputs of variable size.

We assume that the size of the input image is nn, and the size of the filter is ff (note that f is generally an odd number). The size of the output image after convolution is (n-f+1)* (n-f+1).During the convolution process, padding is sometimes necessary to avoid information loss. Additionally, adjusting the stride allows for compression of some information.If we want to perform convolution on a three-channel RGB image, the corresponding filter group would also have three channels. The process involves convolving each individual channel with its corresponding filter, summing up the results, and then adding the sums of the three channels together. The resulting sum of the 27 multiplications is considered as one pixel value of the output image. The filters for different channels can be different. When the input has specific height, width, and channel dimensions, the filters can have different height and width, but the number of channels must match the input.Pooling layers are commonly included in many CNNs. The purpose of pooling layers is to reduce the size of the model, improve computational speed, and simultaneously decrease noise to enhance the robustness of the extracted features.

Important networks in the history of computer vision

LeNet-5

LeNet-5, developed by Yann LeCun et al. in 1998, was one of the first successful convolutional neural networks (CNNs) for handwritten digit recognition. It laid the foundation for modern CNN architectures and demonstrated the power of deep learning in computer vision tasks. “Gradient-Based Learning Applied to Document Recognition” by Yann LeCun et al. (1998). LeNet’s network architecture has seven layers: convolutional layer (Convolutions, C1), pooling layer (Subsampling, S2), convolutional layer (C3), pooling layer (S4), fully connected convolutional layer ( C5), fully connected layer (F6), Gaussian connected layer (output).The input layer is a 28x28 one-dimensional image, and the Filter size is 5x5. The output channels of the first Filter and the second Filter are 6 and 16 respectively, and both use Sigmoid as the activation function.
The window of the pooling layer is 2x2, the stride is 2, and the sampling is performed using average pooling. The number of neurons in the last fully connected layer is 120 and 84, respectively.The last output layer is the Gaussian connection layer, which uses the RBF function (radial Euclidean distance function) to calculate the Euclidean distance between the input vector and the parameter vector.

AlexNet

AlexNet, introduced by Alex Krizhevsky et al. in 2012, was a breakthrough CNN architecture that won the ImageNet competition and popularized deep learning in computer vision. It demonstrated the effectiveness of deep CNNs for image classification tasks and paved the way for subsequent advancements.”ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky et al. (2012).AlexNet’s architecture has eight layers, using a total of five convolutional layers and three fully connected layers, which is deeper than the LeNet model.The first to fifth layers are convolutional layers, where the first, second, and fifth convolutional layers are followed by pooling layers, and Maxpooling with a size of 3x3 and a stride of 2 is used.The sixth to eighth layers are fully connected layers. Changing the Sigmoid used by LeNet to ReLU can avoid the problem of vanishing gradient due to too deep neural network layers or too small gradients.

VGGNet

The VGGNet, proposed by Karen Simonyan and Andrew Zisserman in 2014, is known for its simplicity and depth. It consisted of deep networks with stacked 3x3 convolutional layers, showing that increasing network depth led to improved performance on image classification tasks.”Very Deep Convolutional Networks for Large-Scale Image Recognition” by Karen Simonyan and Andrew Zisserman (2014).Compared with AlexNet, VGGNet adopts a deeper network. It is characterized by repeated use of the same set of basic modules, and uses small convolution kernels instead of medium and large convolution kernels in AlexNet. Its architecture consists of n VGG Blocks and 3 full connections composed of layers.The structure of VGG Block is composed of 3x3 convolutional layers (kernel size=3x3, stride=1, padding=”same”) of different numbers (the number is hyperparameters), and 2x2 Maxpooling (pool size=2, stride=2).VGGNet has many different structures, such as VGG11, VGG13, VGG16, VGG19, the difference lies in the number of layers of the network (the number of convolutional layers and the number of fully connected layers). The common VGGNet refers to VGG16.

Network in Network

“Network in Network” (NiN) refers to a neural network architecture proposed by Lin et al. in their paper titled “Network In Network” published in 2014. NiN is designed to enhance the expressive power of deep neural networks by incorporating micro neural networks called “MLPs (Multi-Layer Perceptrons)” or “1x1 Convolutions” within the network structure.

The key idea behind NiN is to replace traditional convolutional layers with what they call “MLP Convolutional Layers” or “1x1 Convolutional Layers.” These layers consist of a series of fully connected layers (MLPs) applied at every pixel location of the input. The purpose is to capture complex local feature interactions and enable more non-linear transformations.By using 1x1 convolutions, NiN can model non-linear relationships within the channels of the input feature map. This allows for richer and more powerful representations compared to standard convolutional layers.
The 1x1 convolutional layer not only integrates the information of different channels at the same position, but also can reduce or increase the dimension of the channel.

GoogLeNet (Inception-v1)

GoogLeNet, presented by Christian Szegedy et al. in 2015, introduced the Inception module and demonstrated the importance of multi-scale feature extraction. It achieved high accuracy while maintaining computational efficiency, inspiring subsequent Inception versions and influencing network designs.”Going Deeper with Convolutions” by Christian Szegedy et al. (2015).
GoogLeNet was designed to address the challenges of deep neural networks, such as computational efficiency and overfitting, while maintaining high accuracy in image classification tasks. It introduced several novel concepts and architectural innovations that made it stand out from previous CNN architectures at the time.

The key feature of GoogLeNet is the Inception module, which utilizes parallel convolutional filters of different sizes (1x1, 3x3, 5x5) to capture features at various scales. This allows the network to learn and represent both local and global features effectively. Additionally, it incorporates 1x1 convolutions for dimensionality reduction and introduces a technique called “bottleneck” layers to reduce the computational complexity.

Inception

In the context of computer vision, “inception” refers to the Inception module or the Inception architecture used in deep convolutional neural networks (CNNs). The Inception module was introduced in the GoogLeNet architecture (also known as Inception-v1) as a key component for efficient and effective feature extraction.The Inception module aims to capture multi-scale features by employing multiple parallel convolutional filters of different sizes within the same layer. By using a combination of 1x1, 3x3, and 5x5 convolutional filters, the Inception module allows the network to learn and extract features at various spatial scales. The Inception module extracts different features through convolution of three different sizes and 3x3 Maxpooling, and then concatenates these four results together with the channel axis. This way of increasing the width of the network can capture more features and details of the picture.But if the sizes of these four results are different, both the convolutional layer and the pooling layer use padding=”same” and stride=1 to ensure the size of the input feature map.

ResNet

ResNet, developed by Kaiming He et al. in 2015, introduced the concept of residual learning. It utilized skip connections or shortcuts to address the vanishing gradient problem and enabled training of extremely deep networks, leading to significant performance gains in image classification and other tasks.”Deep Residual Learning for Image Recognition” by Kaiming He et al. (2015).

DenseNet

DenseNet, introduced by Gao Huang et al. in 2016, focused on dense connectivity patterns between layers. It aimed to alleviate the vanishing gradient problem, promote feature reuse, and encourage better gradient flow. DenseNet achieved competitive results while reducing the number of parameters compared to other architectures. “Densely Connected Convolutional Networks” by Gao Huang et al. (2016).’

ResNeXt

ResNeXt is a convolutional neural network (CNN) architecture that builds upon the concepts introduced by the ResNet (Residual Network) model. ResNeXt was proposed by Xie et al. in their paper titled “Aggregated Residual Transformations for Deep Neural Networks” in 2017.

The main idea behind ResNeXt is to leverage the concept of “cardinality” to improve the representational power of the network. Cardinality refers to the number of independent pathways or branches within a block of the network. In ResNeXt, instead of using a single pathway in each block, multiple parallel pathways are employed.

references

https://juejin.cn/post/7104845694225088525
https://www.showmeai.tech/article-detail/221
https://medium.com/ching-i/%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E7%B5%A1-cnn-%E7%B6%93%E5%85%B8%E6%A8%A1%E5%9E%8B-lenet-alexnet-vgg-nin-with-pytorch-code-84462d6cf60c

Introduction to deep learning in computer vision

http://yoursite.com/2023/06/09/image-task/

Author

s-serenity

Posted on

2023-06-09

Updated on

2023-07-19

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.