opencv

basic functions

reading and writing

cv2.VideoWriter_fourcc(‘M’, ‘P’, ‘4’, ‘V’)
cv2.VideoWriter(filename,fourcc,fps,frameSize[,isColor])
cv2.VideoWriter.write(image)

object tracking

object tracking

Multiple Object Tracking(MOT) is the task of detecting various objects of interest in a video, tracking these detected objects in subsequent frames by assigning them a unique ID, and maintaining these unique IDs as the objects move around in a video in successive frames.Generally, multiple object tracking happens in two stages: object detection and object association. Object detection is the process of identifying all potential objects of interest in the current frame using object detectors such as Faster-RCNN or YOLO. Object association is the process of linking objects detected in the current frame with its corresponding objects from previous frames, referred to as tracklets. Object or instance association is usually done by predicting the object’s location at the current frame based on previous frames’ tracklets using the Kalman Filter followed by one-to-one linear assignment typically using the Hungarian Algorithm to minimise the total differences between the matching results.

Metrics

MOTP (Multiple Object Tracking Precision)

MOTP (Multi-Object Tracking Precision) expresses how well exact positions of the object are estimated. It is the total error in estimated position for matched ground truth-hypothesis pairs over all frames, averaged by the total number of matches made. This metric is not responsible for recognizing object configurations and evaluating object trajectories.

MOTA (Multiple Object Tracking Accuracy)

MOTA (Multi-Object Tracking Accuracy) shows how many errors the tracker system has made in terms of Misses, False Positives, Mismatch errors, etc. Therefore, it can be derived from three error ratios: the ratio of Misses, the ratio of False positives, and the ratio of Mismatches over all the frames.

IDF1 score (IDF1)

IDF1 score (IDF1) is the ratio of correctly identified detections over the average of ground truth and predicted detections.

Benchmarks

OTB

KITTI

MOT16

Methods(models)

IOU tracker

The Intersection-Over-Union (IOU) tracker uses the IOU values among the detector’s bounding boxes between the two consecutive frames to perform the association between them or assign a new target ID if no match found.

Simple Online And Realtime Tracking (SORT)

Simple Online And Realtime Tracking (SORT) is a lean implementation of a tracking-by detection framework.SORT uses the position and size of the bounding boxes for both motion estimation and data association through frames. SORT combines location and motion cues by adopting a Kalman filter to predict the location of the tracklets in the new frame, then computes the IoU between the detection boxes and the predicted boxes as the similarity.

DeepSORT

DeepSORT replaces the association metric with a more informed metric that combines motion and appearance information. In particular, a “deep appearance” distance metric is added. The core idea is to obtain a vector that can be used to represent a given image. DeepSort adopts a stand-alone RE-ID model to extract appearance features from the detection boxes. After similarity computation matching strategy assigns identities to the objects. This can be done by the Hungarian Algorithm or greedy assignment.

FairMOT

FairMOT is a new tracking approach built on top of the anchor-free object detection architecture CenterNet.It has a simple network structure that consists of two homogeneous branches for detecting objects and extracting re-ID features.

TransMOT

TransMOT is a new spatial-temporal graph Transformer that solves all these issues. It arranges the trajectories of all the tracked objects as a series of sparse weighted graphs that are constructed using the spatial relationships of the targets. TransMOT then uses these graphs to create a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial transformer decoder layer to model the spatial-temporal relationships of the objects.

ByteTrack

BYTE is an effective association method that utilizes all detection boxes from high scores to low ones in the matching process.BYTE is built on the premise that the similarity with tracklets provides a strong cue to distinguish the objects and background in low score detection boxes. BYTE first matches the high score detection boxes to the tracklets based on motion similarity. It uses Kalman Filter to predict the location of the tracklets in the new frame. The motion similarity is computed by the IoU of the predicted box and the detection box. Then, it performs the second matching between the unmatched tracklets.

The primary innovation of BYTETrack is keeping non-background low confidence detection boxes which are typically discarded after the initial filtering of detections and use these low-score boxes for a secondary association step. Typically, occluded detection boxes have lower confidence scores than the threshold, but still contain some information about the objects which make their confidence score higher than purely background boxes. Hence, these low confidence boxes are still meaningful to keep track of during the association stage.

Comparison of DeepSort and ByteTrack

DeepSort uses a pre-trained object detection model to detect objects in each frame and a Siamese network to match the detected objects based on their appearance features. It also uses Kalman filters to predict the locations of the objects in the next frame. ByteTrack, on the other hand, uses a lightweight Siamese network architecture that takes in two input frames and outputs a similarity score. It also uses a simple but effective data augmentation technique to improve its performance on challenging datasets.

using ByteTrack

ByteTracker initiates a new tracklet only if a detection is not matched with any previous tracklet and the bounding box score is higher than a threshold.

references

https://www.datature.io/blog/introduction-to-bytetrack-multi-object-tracking-by-associating-every-detection-box
https://pub.towardsai.net/multi-object-tracking-metrics-1e602f364c0c
https://learnopencv.com/object-tracking-and-reidentification-with-fairmot/
https://medium.com/augmented-startups/top-5-object-tracking-methods-92f1643f8435
https://medium.com/@pedroazevedo6/object-tracking-state-of-the-art-2022-fe9457b77382

segmentation

Image segmentation

Image segmentation is a sub-domain of computer vision and digital image processing which aims at grouping similar regions or segments of an image under their respective class labels.

Semantic segmentation

Semantic segmentation refers to the classification of pixels in an image into semantic classes.

Instance segmentation

Instance segmentation models classify pixels into categories on the basis of “instances” rather than classes.

Panoptic segmentation

Panoptic segmentation can be expressed as the combination of semantic segmentation and instance segmentation where each instance of an object in the image is segregated and the object’s identity is predicted.

Neural networks that perform segmentation typically use an encoder-decoder structure where the encoder is followed by a bottleneck and a decoder or upsampling layers directly from the bottleneck (like in the FCN).

Introduction to deep learning in computer vision

Basic architecture

CNN

Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions, parameter sharing and equivariant representations. Moreover, convolution provides a means for working with inputs of variable size.

We assume that the size of the input image is nn, and the size of the filter is ff (note that f is generally an odd number). The size of the output image after convolution is (n-f+1)* (n-f+1).During the convolution process, padding is sometimes necessary to avoid information loss. Additionally, adjusting the stride allows for compression of some information.If we want to perform convolution on a three-channel RGB image, the corresponding filter group would also have three channels. The process involves convolving each individual channel with its corresponding filter, summing up the results, and then adding the sums of the three channels together. The resulting sum of the 27 multiplications is considered as one pixel value of the output image. The filters for different channels can be different. When the input has specific height, width, and channel dimensions, the filters can have different height and width, but the number of channels must match the input.Pooling layers are commonly included in many CNNs. The purpose of pooling layers is to reduce the size of the model, improve computational speed, and simultaneously decrease noise to enhance the robustness of the extracted features.

Important networks in the history of computer vision

LeNet-5

LeNet-5, developed by Yann LeCun et al. in 1998, was one of the first successful convolutional neural networks (CNNs) for handwritten digit recognition. It laid the foundation for modern CNN architectures and demonstrated the power of deep learning in computer vision tasks. “Gradient-Based Learning Applied to Document Recognition” by Yann LeCun et al. (1998). LeNet’s network architecture has seven layers: convolutional layer (Convolutions, C1), pooling layer (Subsampling, S2), convolutional layer (C3), pooling layer (S4), fully connected convolutional layer ( C5), fully connected layer (F6), Gaussian connected layer (output).The input layer is a 28x28 one-dimensional image, and the Filter size is 5x5. The output channels of the first Filter and the second Filter are 6 and 16 respectively, and both use Sigmoid as the activation function.
The window of the pooling layer is 2x2, the stride is 2, and the sampling is performed using average pooling. The number of neurons in the last fully connected layer is 120 and 84, respectively.The last output layer is the Gaussian connection layer, which uses the RBF function (radial Euclidean distance function) to calculate the Euclidean distance between the input vector and the parameter vector.

AlexNet

AlexNet, introduced by Alex Krizhevsky et al. in 2012, was a breakthrough CNN architecture that won the ImageNet competition and popularized deep learning in computer vision. It demonstrated the effectiveness of deep CNNs for image classification tasks and paved the way for subsequent advancements.”ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky et al. (2012).AlexNet’s architecture has eight layers, using a total of five convolutional layers and three fully connected layers, which is deeper than the LeNet model.The first to fifth layers are convolutional layers, where the first, second, and fifth convolutional layers are followed by pooling layers, and Maxpooling with a size of 3x3 and a stride of 2 is used.The sixth to eighth layers are fully connected layers. Changing the Sigmoid used by LeNet to ReLU can avoid the problem of vanishing gradient due to too deep neural network layers or too small gradients.

VGGNet

The VGGNet, proposed by Karen Simonyan and Andrew Zisserman in 2014, is known for its simplicity and depth. It consisted of deep networks with stacked 3x3 convolutional layers, showing that increasing network depth led to improved performance on image classification tasks.”Very Deep Convolutional Networks for Large-Scale Image Recognition” by Karen Simonyan and Andrew Zisserman (2014).Compared with AlexNet, VGGNet adopts a deeper network. It is characterized by repeated use of the same set of basic modules, and uses small convolution kernels instead of medium and large convolution kernels in AlexNet. Its architecture consists of n VGG Blocks and 3 full connections composed of layers.The structure of VGG Block is composed of 3x3 convolutional layers (kernel size=3x3, stride=1, padding=”same”) of different numbers (the number is hyperparameters), and 2x2 Maxpooling (pool size=2, stride=2).VGGNet has many different structures, such as VGG11, VGG13, VGG16, VGG19, the difference lies in the number of layers of the network (the number of convolutional layers and the number of fully connected layers). The common VGGNet refers to VGG16.

Network in Network

“Network in Network” (NiN) refers to a neural network architecture proposed by Lin et al. in their paper titled “Network In Network” published in 2014. NiN is designed to enhance the expressive power of deep neural networks by incorporating micro neural networks called “MLPs (Multi-Layer Perceptrons)” or “1x1 Convolutions” within the network structure.

The key idea behind NiN is to replace traditional convolutional layers with what they call “MLP Convolutional Layers” or “1x1 Convolutional Layers.” These layers consist of a series of fully connected layers (MLPs) applied at every pixel location of the input. The purpose is to capture complex local feature interactions and enable more non-linear transformations.By using 1x1 convolutions, NiN can model non-linear relationships within the channels of the input feature map. This allows for richer and more powerful representations compared to standard convolutional layers.
The 1x1 convolutional layer not only integrates the information of different channels at the same position, but also can reduce or increase the dimension of the channel.

GoogLeNet (Inception-v1)

GoogLeNet, presented by Christian Szegedy et al. in 2015, introduced the Inception module and demonstrated the importance of multi-scale feature extraction. It achieved high accuracy while maintaining computational efficiency, inspiring subsequent Inception versions and influencing network designs.”Going Deeper with Convolutions” by Christian Szegedy et al. (2015).
GoogLeNet was designed to address the challenges of deep neural networks, such as computational efficiency and overfitting, while maintaining high accuracy in image classification tasks. It introduced several novel concepts and architectural innovations that made it stand out from previous CNN architectures at the time.

The key feature of GoogLeNet is the Inception module, which utilizes parallel convolutional filters of different sizes (1x1, 3x3, 5x5) to capture features at various scales. This allows the network to learn and represent both local and global features effectively. Additionally, it incorporates 1x1 convolutions for dimensionality reduction and introduces a technique called “bottleneck” layers to reduce the computational complexity.

Inception

In the context of computer vision, “inception” refers to the Inception module or the Inception architecture used in deep convolutional neural networks (CNNs). The Inception module was introduced in the GoogLeNet architecture (also known as Inception-v1) as a key component for efficient and effective feature extraction.The Inception module aims to capture multi-scale features by employing multiple parallel convolutional filters of different sizes within the same layer. By using a combination of 1x1, 3x3, and 5x5 convolutional filters, the Inception module allows the network to learn and extract features at various spatial scales. The Inception module extracts different features through convolution of three different sizes and 3x3 Maxpooling, and then concatenates these four results together with the channel axis. This way of increasing the width of the network can capture more features and details of the picture.But if the sizes of these four results are different, both the convolutional layer and the pooling layer use padding=”same” and stride=1 to ensure the size of the input feature map.

ResNet

ResNet, developed by Kaiming He et al. in 2015, introduced the concept of residual learning. It utilized skip connections or shortcuts to address the vanishing gradient problem and enabled training of extremely deep networks, leading to significant performance gains in image classification and other tasks.”Deep Residual Learning for Image Recognition” by Kaiming He et al. (2015).

DenseNet

DenseNet, introduced by Gao Huang et al. in 2016, focused on dense connectivity patterns between layers. It aimed to alleviate the vanishing gradient problem, promote feature reuse, and encourage better gradient flow. DenseNet achieved competitive results while reducing the number of parameters compared to other architectures. “Densely Connected Convolutional Networks” by Gao Huang et al. (2016).’

ResNeXt

ResNeXt is a convolutional neural network (CNN) architecture that builds upon the concepts introduced by the ResNet (Residual Network) model. ResNeXt was proposed by Xie et al. in their paper titled “Aggregated Residual Transformations for Deep Neural Networks” in 2017.

The main idea behind ResNeXt is to leverage the concept of “cardinality” to improve the representational power of the network. Cardinality refers to the number of independent pathways or branches within a block of the network. In ResNeXt, instead of using a single pathway in each block, multiple parallel pathways are employed.

references

https://juejin.cn/post/7104845694225088525
https://www.showmeai.tech/article-detail/221
https://medium.com/ching-i/%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E7%B5%A1-cnn-%E7%B6%93%E5%85%B8%E6%A8%A1%E5%9E%8B-lenet-alexnet-vgg-nin-with-pytorch-code-84462d6cf60c

object detection

Object detection

Object detection is the field of computer vision that deals with the localization and classification of objects contained in an image or video.
Deep learning-based approaches use neural network architectures like RetinaNet, YOLO (You Only Look Once), CenterNet, SSD (Single Shot Multibox detector), Region proposals (R-CNN, Fast-RCNN, Faster RCNN, Cascade R-CNN) for feature detection of the object, and then identification into labels.The YOLO series current provide the SOTA of object detection in real-time.

Object detection usually consists of the following parts:
Input: Refers to the input of the picture
Backbone: A skeleton pre-trained on ImageNet
Neck: Usually used to extract feature maps of different levels
Head: Predict the object category and the detector of bndBox, usually divided into two types: Dense Prediction (one stage), Sparse Prediction (two stage).

metric

mAP

Mean average precision (mAP) is the average value of AP of each category.The AP metric is the area under curve (AUC) of PR curve (Precision-Recall curve).This metric provides a balanced assessment of precision and recall by considering the area under the precision-recall curve. PR curve is a curve drawn with Recall as the X axis and Precision as the Y axis. The higher the Precision and Recall, the better the performance of the model, so the closer to the upper right corner, the better. The AP metric incorporates the Intersection over Union (IoU) measure to assess the quality of the predicted bounding boxes. If the IOU is greater than the threshold (Threshold, usually set to 0.5), and the same Ground Truth can only be calculated once, it will be considered as a TP.

Intersection over Union(IoU)

IoU is the ratio of the intersection area to the union area of the predicted bounding box and the ground truth bounding box. It measures the overlap between the ground truth and predicted bounding boxes.

Flops and FPS

FLOPS (Floating-Point Operations Per Second) is a measure of a computer’s or a processor’s performance in terms of the number of floating-point operations it can perform per second.Higher FLOPS values generally indicate faster computational capabilities.FPS (Frames Per Second) is a measure of how many individual frames (images) a video system can display or process per second.

Non-Maximum Suppression (NMS)

Non-Maximum Suppression (NMS) is a post-processing technique used in object detection algorithms to reduce the number of overlapping bounding boxes and improve the overall detection quality.

Model History

Traditionally, object detection is done by Viola Jones Detector \cite{viola2001rapid}, Histogram of Oriented Gradients (HOG) detector, or Deformable Part-based Model (DPM) before deep learning took off. With deep learning, object detection generally is categorized into 2 categories: one-stage detector and two-stage detector. Two-stage detector is started by Regions with CNN features (RCNN). Spatial Pyramid Pooling Networks (SPPNet), Fast RCNN, Faster RCNN, and Feature Pyramid Networks (FPN) were proposed after it. Limited by the poor speed of the two-stage detector, the one-stage detector came with the first representative You Only Look Once (YOLO). Subsequent versions of YOLO, Single Shot MultiBox Detector (SSD), RetinaNet, CornerNet, CenterNet,DETR were proposed latter. YOLOv7 performs best compared to most detectors.

RCNN

The object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classspecific linear SVMs.

YOLO series

The history of YOLO (You Only Look Once) dates back to 2015 when the original YOLO algorithm was introduced in “You Only Look Once: Unified, Real-Time Object Detection,” .The original YOLO architecture used a convolutional neural network (CNN) to process the entire image and output a fixed number of bounding boxes along with their associated class probabilities. It divided the image into a grid and applied convolutional operations to predict bounding boxes within each grid cell, considering multiple scales and aspect ratios.In subsequent years, YOLO underwent several iterations and improvements to enhance its accuracy and speed. YOLOv2 was introduced in 2016, featuring an updated architecture that incorporated anchor boxes and multi-scale predictions. YOLOv3 followed in 2018, introducing further advancements, including feature pyramid networks (FPN) and Darknet-53 as the backbone architecture.

YOLO (You Only Look Once)

Network architecture is inspired by the GoogLeNet model for image classification.The network has 24 convolutional layers followed by 2 fully connected layers. They pretrain our convolutional layers on the ImageNet 1000-class competition dataset.For pretraining they use the first 20 convolutional layers followed by a average-pooling layer and a fully connected layer. Then they add four convolutional layers and two fully connected layers with randomly initialized weights. The final layer predicts both class probabilities and bounding box coordinates. They optimize for sum-squared error in the output of the model by increasing the loss from bounding box coordinate predictions and decreasing the loss from confidence predictions for boxes that don’t contain objects and predicting the square root of the bounding box width and height instead of the width and height directly. They design the loss to handle the problem that the sum-squared error weights localization error equally with classification error and also equally weights errors in large boxes and small boxes.

YOLOv2

The improvements of YOLOv2 in YOLOv1:
The author adds a batch normalization layer after each convolutional layer, no longer uses dropout.
YOLOv1 uses a 224x224 image classifier. YOLO2 increases the resolution to 448x448.
Because YOLOv1 has difficulty learning to adapt to the shape of different objects during training, resulting in poor performance in precise positioning. YOLOv2 also tries to use rectangles of different shapes as anchor boxes (Anchor Box).Unlike YOLOv1, Anchor Box does not directly predict the coordinate value of bndBox, but predicts the offset (offset value of coordinates) and confidence scores (confidence) of Anchor Box.
In Faster R-CNN and SSD, the size of the Anchor Box is manually selected.YOLOv2 uses the k-means clustering method to perform cluster analysis on the bndBox of the objects in the training set.
YOLOv2 uses a new basic model (feature extractor) Darknet-19, including 19 convolutional layers, 5 maxpooling layers.

YOLO9000

YOLO9000 is a model that can detect more than 9,000 categories proposed on the basis of YOLOv2. Its main contribution is to propose a joint training strategy for classification and detection.For the detection data set, it is used to learn the bounding box (bndBox), confidence (confidence) and object classification of the predicted object, while for the classification data set, it is only used to learn classification, but it can greatly expand the capabilities of the model the type of object detected.

The author proposes a hierarchical classification method (Hierarchical classification),which establishes a tree structure WordTree according to the affiliation between categories.When softmax is performed, it is not performed on all categories, but on the categories of the same level.When making predictions, it traverses down from the root node, selects the child node with the highest probability at each level, and calculates the product of all conditional probabilities from the node to the root node. Stop when the product of the conditional probability is less than a certain threshold, and use the current node to represent the predicted category.

YOLOv3

On the basis of YOLOv2, YOLOv3 improves the network backbone, uses multi-scale feature maps (feature maps) for detection, and uses multiple independent Logistic regression classifiers instead of softmax to predict category classification.YOLOv3 proposes a new backbone: Darknet-53, from layer 0 to layer 74, a total of 53 convolutional layers, and the rest are Resnet layers.Darknet-53 joins Resnet Network (Residual Network) to solve the gradient problem.
YOLOv3 draws on the Feature Pyramid Network (FPN) method, uses multi-scale feature maps to detect objects of different sizes, and improves the prediction ability of small objects.The feature map of each scale will predict 3 Anchor priors, and the size of the Anchor priors is clustered using K-means.

Feature Pyramid Networks (FPN)

The main idea behind FPNs is to leverage the nature of convolutional layers — which reduce the size of the feature space and increase the coverage of each feature in the initial image — to output predictions at different scales.FPNs provide semantically strong features at multiple scales which make them extremely well suited for object detection.

YOLOv4

Bag-of-Freebies refers to the techniques used in network training, which does not affect the time of reasoning and prediction, mainly including:
Data augmentation: Random erase, CutOut, Hide-and-seek, Grid mask, GAN, MixUp, CutMix;Regularization methods: DropOut, DropConnect;Dealing with data imbalance: focal loss, Online hard example mining, Hard negative example mining;Handle bndBox regression problems: MSE, IOU, GIOU, DIOU/CIOU.

Bag-of-specials refers to the techniques used in network design or post-processing, which slightly increases the time of reasoning and prediction, but can improve the accuracy, mainly including:Receptive field: SPP, ASPP, RFB;Feature Fusion: FPN, PAN;Attention mechanism: attention module;
Activation functions: Swish, Mish;NMS: Soft-NMS、DIoU NMS.

The architecture of the YOLOv4 model consists of three parts
BackBone: CSPDarknet53; Neck: SPP+PAN; HEAD: YOLO HEAD.

Cross Stage Partial Network (CSPNet)

The main purpose of CSPNet is to enable the network architecture to obtain richer gradient fusion information and reduce the amount of calculation.
The method is to first divide the feature map of the Base layer into two parts, and then pass through transition -> concatenation -> transition. parts merged.This approach allows CSPNet to solve three problems:
Increase the learning ability of CNN, even if the model is lightweight, it can maintain accuracy;
Remove the computing bottleneck structure with high computing power (reduce computing);Reduce memory usage.

SPP+PAN

SPP (Spatial Pyramid Pooling): Concate all feature maps in the last layer of the network, and then continue to connect CNN module.
PANet (Path Aggregation Network): Improve on the basis of FPN.

CutMix

CutMix is ​​a data enhancement method proposed in 2019. The method is to cut off a part of the area but not fill it with 0 pixels, but randomly fill the area pixel values ​​​​of other data in the training set.Mixup: Mix two random samples proportionally, and the classification results are distributed proportionally.utout: Randomly cut out some areas in the sample and fill them with 0 pixel values, and the classification result remains unchanged.

Mosaic data augmentation

Whilst common transforms in object detection tend to be augmentations such as flips and rotations, the YOLO authors take a slightly different approach by applying Mosaic augmentation; which was previously used by YOLOv4, YOLOv5 and YOLOX models.The objective of mosaic augmentation is to overcome the observation that object detection models tend to focus on detecting items towards the centre of the image. The key idea is that, if we stitch multiple images together, the objects are likely to be in positions and contexts that are not normally observed in images seen in the dataset; which should force the features learned by the model to be more position invariant. It uses random scaling and cropping to mix and stitch 4 kinds of pictures for training. When using Mosaic training, the data of 4 pictures can be directly calculated, so that the size of the Mini-batch does not need to be large.

Post-mosaic affine transforms

As we noted earlier, the mosaics that we are creating are significantly bigger than the image sizes we will use to train our model, so we will need to do some sort of resizing here. Whilst this would work, this is likely to result in some very small objects, as we are essentially resizing four images to the size of one - which is likely to become a problem where the domain already contains very small bounding boxes. Additionally, each of our mosaics are structurally quite similar, with an image in each quadrant. Recalling that our aim was to make the model more robust to position changes, this may not actually help that much; as the model is likely just to start looking in the middle of each quadrant.To overcome this, one approach that we can take is to simply take a random crop from our mosaic. This will still provide the variability in positioning whilst preserving the size and aspect ratio of the target objects. At this point, it may also be a good opportunity to add in some other transforms such as scaling and rotation to add even more variability.

DropBlock regularization

Dropout, which randomly deletes the number of neurons, but the network can still learn the same information from adjacent activation units.
DropBlock randomly deletes the entire local area, and the network will focus on learning certain features to achieve correct classification and get better generalization effects.

Class label smoothing

In multi-classification tasks, the output is usually normalized with softmax, and then one-hot label is used to calculate the cross-entropy loss function to train the model. However, the use of one-hot vector representation can easily lead to the problem of network overfitting, so Label Smoothing is to make the one-hot label softer, so that the phenomenon of overfitting can be effectively suppressed when calculating the loss, and the generalization ability of the model can be improved.

Mish activation

Mish is a continuously differentiable non-monotonic activation function. Compared with ReLU, Mish’s gradient is smoother, and it allows a smaller negative gradient when it is negative, which can stabilize the network gradient flow and has better generalization ability.
$f(x) = xtanh(ln(1+e^x))$.

Multiinput weighted residual connections (MiWRC)

YOLOv4 refers to the architecture and method of EfficientDet , and uses the multi-input weighted residual connection (MiWRC).The backbone of EfficientDet uses EfficientNet, Neck is BiFPN.EfficientNet-B0 is constructed by multiple MBConv Blocks. MBConv Block refers to the Inverted Residual Block of MobileNet V2.The design of MBConv is to first increase the dimension and then reduce the dimension, which is different from the operation of the residual block to first reduce the dimension and then increase the dimension. This design allows MobileNetV2 to better use the residual connection to improve Accuracy.The idea of ​​MiWRC is derived from BiFPN. In FPN, the features obtained by each layer are regarded as equal, while MiWRC believes that the features of different layers should have different importance, and different weight ratios should be given to the features of different scales.

loss

2 problems with using IOU loss:When the predict box (predict bndBox) and the target box (ground truth) do not intersect, the IOU is 0, which cannot reflect the distance between the two boxes. At this time, the loss function is not derivable, that is to say, the gradient cannot be calculated, so it cannot Optimizing the case where two boxes do not intersect;The IOU cannot reflect the coincidence size of the prediction frame and the target frame.
Subsequent GIoU, DIoU, CIoU are based on IOU loss to add a penalty item:
GIOU loss (Generalized IOU loss):C is the minimum bounding box of the target box Ground Truth and the prediction box Predict.
$L_{GIOU}=1-IOU+\frac{|C-B\cupB^{gt}|}{|C|}$.
DIOU loss (Distance IOU loss) considers the overlapping area and the center point distance, and adds a penalty term to minimize the center point distance between the two boxes.CIOU loss (Complete IOU loss) adds a penalty item based on DIOU, taking into account the factor of aspect ratio.

CmBN (Cross mini-Batch Normalization)

BN is to normalize the current mini-batch, but often the batch size is very small, and uneven sampling may occur, which may cause problems in normalization. Therefore, there are many Batch Normalization methods for small batch sizes.The idea of ​​CBN is to calculate the previous mini-batch together, but not keep too many mini-batches. The method is to normalize the results of the current and the current 3 mini-batches.The CmBN newly created by YOLOv4 is based on CBN for modification, and does not update calculations between mini-batches, but updates network parameters after a batch is completed.

Self-Adversarial Training (SAT)

SAT is a data enhancement method innovated by the author, which is completed in two stages:First, forward-propagate the training samples, and then modify the image pixels (without modifying the network weights) during back-propagation to reduce the performance of model detection. In this way, the neural network can perform adversarial attacks on itself. Creates the illusion that there is no detected object in the picture. This first stage is actually increasing the difficulty of training samples.The second stage is to use the modified pictures to train the model.

Eliminate grid sensitivity

The author observed a video of object detection and found that because the center point of the detected object is mostly located close to the center of the Grid, it is difficult to detect when it is on the edge of the Grid. The author believes that the problem that the center point of the detected object is mostly located close to the center point of the Grid is because of the gradient of the Sigmoid function. Therefore, the author made some changes in the Sigmoid function, multiplying Sigmoid by a value greater than 1, and taking into account the sensitivity of different Grid sizes to boundary effects, using (1+x)Sigmoid — (0.5x), where When the Grid resolution is higher, the x will be higher.

Cosine annealing scheduler

Cosine annealing is to use the cosine function to adjust the learning rate. At the beginning, the learning rate will be slowly reduced, then accelerated halfway, and finally slowed down again.

Optimal hyperparameters

Use Genetic Algorithms (Evolutionary Algorithms) to select hyperparameters. The method is to randomly combine hyperparameters for training, then select the best 10% hyperparameters and then randomly combine and train them, and finally select the best model.

SAM-block (Spatial Attention Module)

SAM is derived from the CBAM (Convolutional Block Attention Module) paper, which provides two attention mechanism techniques.

DIoU-NMS

In the classic NMS, the detection frame with the highest confidence and other detection frames will calculate the corresponding IOU value one by one, and the detection frame whose value exceeds the threshold is filtered out. But in the actual situation, when two different objects are very close, due to the relatively large IOU value, after the NMS algorithm, there is often only one detection frame left, which may cause missed detection.DIoU-NMS considers not only the IOU value, but also the distance between the center points of two boxes. If the IOU between the two frames is relatively large, but the distance between them is relatively far, it will be considered as the detection frame of different objects and will not be filtered out.

YOLOv7

Anchor boxes

YOLOv7 family is an anchor-based model.In these models, the general philosophy is to first create lots of potential bounding boxes, then select the most promising options to match to our target objects; slightly moving and resizing them as necessary to obtain the best possible fit.The basic idea is that we draw a grid on top of each image and, at each grid intersection (anchor point), generate candidate boxes (anchor boxes) based on a number of anchor sizes. That is, the same set of boxes is repeated at each anchor point. However, one issue with this approach is that our target, ground truth, boxes can range in size — from tiny to huge! Therefore, it is usually not possible to define a single set of anchor sizes that can be matched to all targets. For this reason, anchor-based model architectures usually employ a Feature-Pyramid-Network (FPN) to assist with this.

Center Priors

If we put 3 anchor boxes in each anchor point of each of the grids, we end up with a lot of boxes.The issue is that most of these predictions are not going to contain an object, which we classify as ‘background’.To make the problem cheaper computationally, the YOLOv7 loss finds first the anchor boxes that are likely to match each target box and treats them differently — these are known as the center prior anchor boxes. This process is applied at each FPN head, for each target box, across all images in batch at once.

model reparameterization

Model re-parametrization techniques merge multiple computational modules into one at inference stage. The model re-parameterization technique can be regarded as an ensemble technique, and we can divide it into two categories, i.e., module-level ensemble and model-level ensemble.

Model scaling

Model scaling is a way to scale up or down an already designed model and make it fit in different computing devices.Network architecture search (NAS) is one of the commonly used model scaling methods.

efficient layer aggregation networks(ELAN)
VovNet/OSANet

VovNet, short for “Variance-based Overparameterized Convolutional Networks,” is a convolutional neural network (CNN) architecture proposed by Lee et al. in their paper “Variance-based Overparameterization for Robustness” in 2019. VovNet is designed to improve the robustness of deep neural networks, particularly in the context of image classification tasks.The key idea behind VovNet is to introduce variance-based overparameterization to enhance the representation power of CNNs. Overparameterization involves increasing the number of parameters in a neural network, which can improve the model’s ability to learn complex patterns and features.
VovNet achieves variance-based overparameterization by introducing multiple “VovNet blocks.” Each VovNet block is designed to capture different levels of granularity within the input data. Instead of using a single set of convolutional filters for all spatial dimensions, VovNet employs different filters for each spatial dimension. This allows the network to capture variations in features at different scales, leading to more robust representations.

One-shot aggregation (OSA) module is designed which is more efficient than Dense Block in DenseNet.By cascading OSA module, an efficient object detection network VoVNet is formed.One-shot aggregation (OSA) module is designed to aggregate its feature in the last layer at once.It has much less Memory access cost (MAC) than that with dense block.Also, OSA improves GPU computation efficiency. The input sizes of intermediate layers of OSA module are constant. Hence, it is unnecessary to adopt additional 1×1 conv bottleneck to reduce dimension. The means it consists of fewer layers.

CSPVOVNet

It combines CSPNet and VoVNet and considers the gradient path for improvement, so that the weights of different layers can learn more diverse features to improve accuracy.

Deep supervision

When training deep networks, auxiliary head and auxiliary classifiers are often added to the middle layer of the neural network to improve stability, convergence speed, and avoid gradient disappearance problems, that is, to use auxiliary loss for shallow layers. Network weights for training, this technique is called Deep Supervision.

dynamic label assignment

Label Assigner is a mechanism that considers the network prediction results together with the ground truth and then assigns soft labels. In the past, the definition of the target label was usually to use a hard label that follows the ground truth. In recent years, it has also been used to perform some optimization operations on the prediction results of the model and the ground truth to obtain a soft label. This mechanism is called label assigner in this paper.The author discusses three methods of assigning soft labels on the auxiliary head and lead head: Independent, Lead head guided label assigner,Coarse-to-fine lead head guided label assigner.Independent:Auxiliary head and lead head perform label assignment with ground truth respectively, which is the most used method at present.Lead head guided label assigner:Since the lead head has a stronger learning ability than the auxiliary head, the soft label obtained by optimizing the prediction result of the lead head and the ground truth can better express the distribution and correlation between the data and the ground truth.Then use the soft label as the target of the auxiliary head and lead head for training, so that the shallower auxiliary head can directly learn the information that the lead head has learned, while the lead head pays more attention to the unlearned residual information. Coarse-to-fine lead head guided label assigner:This part is also the soft label obtained by optimizing the prediction result of the lead head and the ground truth. The difference is that two different soft labels will be generated: coarse label and fine label, where the fine label is the same as the soft label of the lead head , coarse label is used for auxiliary head.

Optimal Transport Assignment

The simplest approach is to define an Intersection over Union (IoU) threshold and decide based on that. While this generally works, it becomes problematic when there are occlusions, ambiguity or when multiple objects are very close together. Optimal Transport Assignment (OTA) aims to solve some of these problems by considering label assignment as a global optimization problem for each image.YOLOv7 implements simOTA (introduced in the YOLOX paper), a simplified version of the OTA problem.

Model EMA

When training a model, it can be beneficial to set the values for the model weights by taking a moving average of the parameters that were observed across the entire training run, as opposed to using the parameters obtained after the last incremental update. This is often done by maintaining an exponentially weighted average (EMA) of the model parameters, in practice, this usually means maintaining another copy of the model to store these averaged weights. This technique has been employed in several training schemes for popular models such as training MNASNet, MobileNet-V3 and EfficientNet.

The approach to EMA taken by the YOLOv7 authors is slightly different to other implementations as, instead of using a fixed decay, the amount of decay changes based on the number of updates that have been made.

Loss algorithm

we can break down the algorithm used in the YOLOv7 loss calculation into the following steps:

  1. For each FPN head (or each FPN head and Aux FPN head pair if Aux heads used):
    Find the Center Prior anchor boxes.
    Refine the candidate selection through the simOTA algorithm. Always use lead FPN heads for this.
    Obtain the objectness loss score using Binary Cross Entropy Loss between the predicted objectness probability and the Complete Intersection over Union (CIoU) with the matched target as ground truth. If there are no matches, this is 0.
    If there are any selected anchor box candidates, also calculate (otherwise they are just 0):
  • The box (or regression) loss, defined as the mean(1 - CIoU) between all candidate anchor boxes and their matched target.
  • The classification loss, using Binary Cross Entropy Loss between the predicted class probabilities for each anchor box and a one-hot encoded vector of the true class of the matched target.
    If model uses auxiliary heads, add each component obtained from the aux head to the corresponding main loss component (i.e., x = x + aux_wt*aux_x). The contribution weight (aux_wt) is defined by a predefined hyperparameter.
    Multiply the objectness loss by the corresponding FPN head weight (predefined hyperparameter).
  1. Multiply each loss component (objectness, classification, regression) by their contribution weight (predefined hyperparameter).

  2. Sum the already weighted loss components.

  3. Multiply the final loss value by the batch size.

using yolov7

github address: https://github.com/WongKinYiu/yolov7;
Format converter:https://github.com/wy17646051/UA-DETRAC-Format-Converter

potential ideas

efficiency

In order to enhance the real-time detection of the network, researchers generally analyze the number of parameters, calculation amount and calculation density from the aspects of model parameters, calculation amount, memory access times, input-output channel ratio, element-wise operation, etc. In fact, these research methods are similar to ShuffleNetV2 at that time.

NAS(Neural Architecture Search)

NAS was an inspiring work out of Google that lead to several follow up works such as ENAS, PNAS, and DARTS. It involves training a recurrent neural network (RNN) controller using reinforcement learning (RL) to automatically generate architectures.

Vision Transformer

The core conclusion in the original ViT paper is that when there is enough data for pre-training, ViT’s performance will exceed CNN, breaking through the limitation of transformer lack of inductive bias, you can use it in Better transfer results in downstream tasks. However, when the training data set is not large enough, the performance of ViT is usually worse than that of ResNets of the same size, because Transformer lacks inductive bias compared with CNN, that is, a priori knowledge, a good assumption in advance.

improve choosing anchor box

datasets

PASCAL VOC 2007, VOC 2012, Microsoft COCO (Common Objects in Context).
UA-DETRAC: https://detrac-db.rit.albany.edu/ https://www.kaggle.com/datasets/patrikskalos/ua-detrac-fix-masks-two-wheelers?resource=download https://colab.research.google.com/github/hardik0/Multi-Object-Tracking-Google-Colab/blob/main/Towards-Realtime-MOT-Vehicle-
https://github.com/hardik0/Towards-Realtime-MOT/tree/master
Tracking.ipynb#scrollTo=y6KZeLt9ViDe
https://github.com/wy17646051/UA-DETRAC-Format-Converter/tree/main
MIO-TCD:https://tcd.miovision.com/
KITTI:https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark
TRANCOS: https://gram.web.uah.es/data/datasets/trancos/index.html
STREETS:https://www.kaggle.com/datasets/ryankraus/traffic-camera-object-detection: single class
VERI-Wild: https://github.com/PKU-IMRE/VERI-Wild

https://universe.roboflow.com/7-class/11-11-2021-09.41
https://universe.roboflow.com/szabo/densitytrafficcontroller-1axlm
https://universe.roboflow.com/future-institute-of-technology-1wuwl/indian-vehicle-set-1
https://universe.roboflow.com/cv-2022-kyjj6/tesi
https://universe.roboflow.com/vehicleclassification-kxtkb/vehicle_classification-fvssn
https://universe.roboflow.com/urban-data/urban-data
https://www.kaggle.com/datasets/ashfakyeafi/road-vehicle-images-dataset
https://github.com/MaryamBoneh/Vehicle-Detection

References

https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-1-33220ebc1d09
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-2-85ee99d114a1
https://medium.com/@chingi071/yolo%E6%BC%94%E9%80%B2-3-yolov4%E8%A9%B3%E7%B4%B0%E4%BB%8B%E7%B4%B9-5ab2490754ef
https://zhuanlan.zhihu.com/p/183261974
https://sh-tsang.medium.com/review-vovnet-osanet-an-energy-and-gpu-computation-efficient-backbone-network-for-real-time-3b26cd035887
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-yolov7-%E8%AB%96%E6%96%87%E9%96%B1%E8%AE%80-97b0e914bdbe
https://towardsdatascience.com/yolov7-a-deep-dive-into-the-current-state-of-the-art-for-object-detection-ce3ffedeeaeb
https://towardsdatascience.com/neural-architecture-search-limitations-and-extensions-8141bec7681f
https://learnopencv.com/fine-tuning-yolov7-on-custom-dataset/#The-Training-Experiments-that-We-Will-Carry-Out
https://learnopencv.com/yolov7-object-detection-paper-explanation-and-inference/