Object detection
Object detection is the field of computer vision that deals with the localization and classification of objects contained in an image or video.
Deep learning-based approaches use neural network architectures like RetinaNet, YOLO (You Only Look Once), CenterNet, SSD (Single Shot Multibox detector), Region proposals (R-CNN, Fast-RCNN, Faster RCNN, Cascade R-CNN) for feature detection of the object, and then identification into labels.The YOLO series current provide the SOTA of object detection in real-time.
Object detection usually consists of the following parts:
Input: Refers to the input of the picture
Backbone: A skeleton pre-trained on ImageNet
Neck: Usually used to extract feature maps of different levels
Head: Predict the object category and the detector of bndBox, usually divided into two types: Dense Prediction (one stage), Sparse Prediction (two stage).
metric
mAP
Mean average precision (mAP) is the average value of AP of each category.The AP metric is the area under curve (AUC) of PR curve (Precision-Recall curve).This metric provides a balanced assessment of precision and recall by considering the area under the precision-recall curve. PR curve is a curve drawn with Recall as the X axis and Precision as the Y axis. The higher the Precision and Recall, the better the performance of the model, so the closer to the upper right corner, the better. The AP metric incorporates the Intersection over Union (IoU) measure to assess the quality of the predicted bounding boxes. If the IOU is greater than the threshold (Threshold, usually set to 0.5), and the same Ground Truth can only be calculated once, it will be considered as a TP.
Intersection over Union(IoU)
IoU is the ratio of the intersection area to the union area of the predicted bounding box and the ground truth bounding box. It measures the overlap between the ground truth and predicted bounding boxes.
Flops and FPS
FLOPS (Floating-Point Operations Per Second) is a measure of a computer’s or a processor’s performance in terms of the number of floating-point operations it can perform per second.Higher FLOPS values generally indicate faster computational capabilities.FPS (Frames Per Second) is a measure of how many individual frames (images) a video system can display or process per second.
Non-Maximum Suppression (NMS)
Non-Maximum Suppression (NMS) is a post-processing technique used in object detection algorithms to reduce the number of overlapping bounding boxes and improve the overall detection quality.
Model History
Traditionally, object detection is done by Viola Jones Detector \cite{viola2001rapid}, Histogram of Oriented Gradients (HOG) detector, or Deformable Part-based Model (DPM) before deep learning took off. With deep learning, object detection generally is categorized into 2 categories: one-stage detector and two-stage detector. Two-stage detector is started by Regions with CNN features (RCNN). Spatial Pyramid Pooling Networks (SPPNet), Fast RCNN, Faster RCNN, and Feature Pyramid Networks (FPN) were proposed after it. Limited by the poor speed of the two-stage detector, the one-stage detector came with the first representative You Only Look Once (YOLO). Subsequent versions of YOLO, Single Shot MultiBox Detector (SSD), RetinaNet, CornerNet, CenterNet,DETR were proposed latter. YOLOv7 performs best compared to most detectors.
RCNN
The object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classspecific linear SVMs.
YOLO series
The history of YOLO (You Only Look Once) dates back to 2015 when the original YOLO algorithm was introduced in “You Only Look Once: Unified, Real-Time Object Detection,” .The original YOLO architecture used a convolutional neural network (CNN) to process the entire image and output a fixed number of bounding boxes along with their associated class probabilities. It divided the image into a grid and applied convolutional operations to predict bounding boxes within each grid cell, considering multiple scales and aspect ratios.In subsequent years, YOLO underwent several iterations and improvements to enhance its accuracy and speed. YOLOv2 was introduced in 2016, featuring an updated architecture that incorporated anchor boxes and multi-scale predictions. YOLOv3 followed in 2018, introducing further advancements, including feature pyramid networks (FPN) and Darknet-53 as the backbone architecture.
YOLO (You Only Look Once)
Network architecture is inspired by the GoogLeNet model for image classification.The network has 24 convolutional layers followed by 2 fully connected layers. They pretrain our convolutional layers on the ImageNet 1000-class competition dataset.For pretraining they use the first 20 convolutional layers followed by a average-pooling layer and a fully connected layer. Then they add four convolutional layers and two fully connected layers with randomly initialized weights. The final layer predicts both class probabilities and bounding box coordinates. They optimize for sum-squared error in the output of the model by increasing the loss from bounding box coordinate predictions and decreasing the loss from confidence predictions for boxes that don’t contain objects and predicting the square root of the bounding box width and height instead of the width and height directly. They design the loss to handle the problem that the sum-squared error weights localization error equally with classification error and also equally weights errors in large boxes and small boxes.
YOLOv2
The improvements of YOLOv2 in YOLOv1:
The author adds a batch normalization layer after each convolutional layer, no longer uses dropout.
YOLOv1 uses a 224x224 image classifier. YOLO2 increases the resolution to 448x448.
Because YOLOv1 has difficulty learning to adapt to the shape of different objects during training, resulting in poor performance in precise positioning. YOLOv2 also tries to use rectangles of different shapes as anchor boxes (Anchor Box).Unlike YOLOv1, Anchor Box does not directly predict the coordinate value of bndBox, but predicts the offset (offset value of coordinates) and confidence scores (confidence) of Anchor Box.
In Faster R-CNN and SSD, the size of the Anchor Box is manually selected.YOLOv2 uses the k-means clustering method to perform cluster analysis on the bndBox of the objects in the training set.
YOLOv2 uses a new basic model (feature extractor) Darknet-19, including 19 convolutional layers, 5 maxpooling layers.
YOLO9000
YOLO9000 is a model that can detect more than 9,000 categories proposed on the basis of YOLOv2. Its main contribution is to propose a joint training strategy for classification and detection.For the detection data set, it is used to learn the bounding box (bndBox), confidence (confidence) and object classification of the predicted object, while for the classification data set, it is only used to learn classification, but it can greatly expand the capabilities of the model the type of object detected.
The author proposes a hierarchical classification method (Hierarchical classification),which establishes a tree structure WordTree according to the affiliation between categories.When softmax is performed, it is not performed on all categories, but on the categories of the same level.When making predictions, it traverses down from the root node, selects the child node with the highest probability at each level, and calculates the product of all conditional probabilities from the node to the root node. Stop when the product of the conditional probability is less than a certain threshold, and use the current node to represent the predicted category.
YOLOv3
On the basis of YOLOv2, YOLOv3 improves the network backbone, uses multi-scale feature maps (feature maps) for detection, and uses multiple independent Logistic regression classifiers instead of softmax to predict category classification.YOLOv3 proposes a new backbone: Darknet-53, from layer 0 to layer 74, a total of 53 convolutional layers, and the rest are Resnet layers.Darknet-53 joins Resnet Network (Residual Network) to solve the gradient problem.
YOLOv3 draws on the Feature Pyramid Network (FPN) method, uses multi-scale feature maps to detect objects of different sizes, and improves the prediction ability of small objects.The feature map of each scale will predict 3 Anchor priors, and the size of the Anchor priors is clustered using K-means.
Feature Pyramid Networks (FPN)
The main idea behind FPNs is to leverage the nature of convolutional layers — which reduce the size of the feature space and increase the coverage of each feature in the initial image — to output predictions at different scales.FPNs provide semantically strong features at multiple scales which make them extremely well suited for object detection.
YOLOv4
Bag-of-Freebies refers to the techniques used in network training, which does not affect the time of reasoning and prediction, mainly including:
Data augmentation: Random erase, CutOut, Hide-and-seek, Grid mask, GAN, MixUp, CutMix;Regularization methods: DropOut, DropConnect;Dealing with data imbalance: focal loss, Online hard example mining, Hard negative example mining;Handle bndBox regression problems: MSE, IOU, GIOU, DIOU/CIOU.
Bag-of-specials refers to the techniques used in network design or post-processing, which slightly increases the time of reasoning and prediction, but can improve the accuracy, mainly including:Receptive field: SPP, ASPP, RFB;Feature Fusion: FPN, PAN;Attention mechanism: attention module;
Activation functions: Swish, Mish;NMS: Soft-NMS、DIoU NMS.
The architecture of the YOLOv4 model consists of three parts
BackBone: CSPDarknet53; Neck: SPP+PAN; HEAD: YOLO HEAD.
Cross Stage Partial Network (CSPNet)
The main purpose of CSPNet is to enable the network architecture to obtain richer gradient fusion information and reduce the amount of calculation.
The method is to first divide the feature map of the Base layer into two parts, and then pass through transition -> concatenation -> transition. parts merged.This approach allows CSPNet to solve three problems:
Increase the learning ability of CNN, even if the model is lightweight, it can maintain accuracy;
Remove the computing bottleneck structure with high computing power (reduce computing);Reduce memory usage.
SPP+PAN
SPP (Spatial Pyramid Pooling): Concate all feature maps in the last layer of the network, and then continue to connect CNN module.
PANet (Path Aggregation Network): Improve on the basis of FPN.
CutMix
CutMix is a data enhancement method proposed in 2019. The method is to cut off a part of the area but not fill it with 0 pixels, but randomly fill the area pixel values of other data in the training set.Mixup: Mix two random samples proportionally, and the classification results are distributed proportionally.utout: Randomly cut out some areas in the sample and fill them with 0 pixel values, and the classification result remains unchanged.
Mosaic data augmentation
Whilst common transforms in object detection tend to be augmentations such as flips and rotations, the YOLO authors take a slightly different approach by applying Mosaic augmentation; which was previously used by YOLOv4, YOLOv5 and YOLOX models.The objective of mosaic augmentation is to overcome the observation that object detection models tend to focus on detecting items towards the centre of the image. The key idea is that, if we stitch multiple images together, the objects are likely to be in positions and contexts that are not normally observed in images seen in the dataset; which should force the features learned by the model to be more position invariant. It uses random scaling and cropping to mix and stitch 4 kinds of pictures for training. When using Mosaic training, the data of 4 pictures can be directly calculated, so that the size of the Mini-batch does not need to be large.
Post-mosaic affine transforms
As we noted earlier, the mosaics that we are creating are significantly bigger than the image sizes we will use to train our model, so we will need to do some sort of resizing here. Whilst this would work, this is likely to result in some very small objects, as we are essentially resizing four images to the size of one - which is likely to become a problem where the domain already contains very small bounding boxes. Additionally, each of our mosaics are structurally quite similar, with an image in each quadrant. Recalling that our aim was to make the model more robust to position changes, this may not actually help that much; as the model is likely just to start looking in the middle of each quadrant.To overcome this, one approach that we can take is to simply take a random crop from our mosaic. This will still provide the variability in positioning whilst preserving the size and aspect ratio of the target objects. At this point, it may also be a good opportunity to add in some other transforms such as scaling and rotation to add even more variability.
DropBlock regularization
Dropout, which randomly deletes the number of neurons, but the network can still learn the same information from adjacent activation units.
DropBlock randomly deletes the entire local area, and the network will focus on learning certain features to achieve correct classification and get better generalization effects.
Class label smoothing
In multi-classification tasks, the output is usually normalized with softmax, and then one-hot label is used to calculate the cross-entropy loss function to train the model. However, the use of one-hot vector representation can easily lead to the problem of network overfitting, so Label Smoothing is to make the one-hot label softer, so that the phenomenon of overfitting can be effectively suppressed when calculating the loss, and the generalization ability of the model can be improved.
Mish activation
Mish is a continuously differentiable non-monotonic activation function. Compared with ReLU, Mish’s gradient is smoother, and it allows a smaller negative gradient when it is negative, which can stabilize the network gradient flow and has better generalization ability.
$f(x) = xtanh(ln(1+e^x))$.
Multiinput weighted residual connections (MiWRC)
YOLOv4 refers to the architecture and method of EfficientDet , and uses the multi-input weighted residual connection (MiWRC).The backbone of EfficientDet uses EfficientNet, Neck is BiFPN.EfficientNet-B0 is constructed by multiple MBConv Blocks. MBConv Block refers to the Inverted Residual Block of MobileNet V2.The design of MBConv is to first increase the dimension and then reduce the dimension, which is different from the operation of the residual block to first reduce the dimension and then increase the dimension. This design allows MobileNetV2 to better use the residual connection to improve Accuracy.The idea of MiWRC is derived from BiFPN. In FPN, the features obtained by each layer are regarded as equal, while MiWRC believes that the features of different layers should have different importance, and different weight ratios should be given to the features of different scales.
loss
2 problems with using IOU loss:When the predict box (predict bndBox) and the target box (ground truth) do not intersect, the IOU is 0, which cannot reflect the distance between the two boxes. At this time, the loss function is not derivable, that is to say, the gradient cannot be calculated, so it cannot Optimizing the case where two boxes do not intersect;The IOU cannot reflect the coincidence size of the prediction frame and the target frame.
Subsequent GIoU, DIoU, CIoU are based on IOU loss to add a penalty item:
GIOU loss (Generalized IOU loss):C is the minimum bounding box of the target box Ground Truth and the prediction box Predict.
$L_{GIOU}=1-IOU+\frac{|C-B\cupB^{gt}|}{|C|}$.
DIOU loss (Distance IOU loss) considers the overlapping area and the center point distance, and adds a penalty term to minimize the center point distance between the two boxes.CIOU loss (Complete IOU loss) adds a penalty item based on DIOU, taking into account the factor of aspect ratio.
CmBN (Cross mini-Batch Normalization)
BN is to normalize the current mini-batch, but often the batch size is very small, and uneven sampling may occur, which may cause problems in normalization. Therefore, there are many Batch Normalization methods for small batch sizes.The idea of CBN is to calculate the previous mini-batch together, but not keep too many mini-batches. The method is to normalize the results of the current and the current 3 mini-batches.The CmBN newly created by YOLOv4 is based on CBN for modification, and does not update calculations between mini-batches, but updates network parameters after a batch is completed.
Self-Adversarial Training (SAT)
SAT is a data enhancement method innovated by the author, which is completed in two stages:First, forward-propagate the training samples, and then modify the image pixels (without modifying the network weights) during back-propagation to reduce the performance of model detection. In this way, the neural network can perform adversarial attacks on itself. Creates the illusion that there is no detected object in the picture. This first stage is actually increasing the difficulty of training samples.The second stage is to use the modified pictures to train the model.
Eliminate grid sensitivity
The author observed a video of object detection and found that because the center point of the detected object is mostly located close to the center of the Grid, it is difficult to detect when it is on the edge of the Grid. The author believes that the problem that the center point of the detected object is mostly located close to the center point of the Grid is because of the gradient of the Sigmoid function. Therefore, the author made some changes in the Sigmoid function, multiplying Sigmoid by a value greater than 1, and taking into account the sensitivity of different Grid sizes to boundary effects, using (1+x)Sigmoid — (0.5x), where When the Grid resolution is higher, the x will be higher.
Cosine annealing scheduler
Cosine annealing is to use the cosine function to adjust the learning rate. At the beginning, the learning rate will be slowly reduced, then accelerated halfway, and finally slowed down again.
Optimal hyperparameters
Use Genetic Algorithms (Evolutionary Algorithms) to select hyperparameters. The method is to randomly combine hyperparameters for training, then select the best 10% hyperparameters and then randomly combine and train them, and finally select the best model.
SAM-block (Spatial Attention Module)
SAM is derived from the CBAM (Convolutional Block Attention Module) paper, which provides two attention mechanism techniques.
DIoU-NMS
In the classic NMS, the detection frame with the highest confidence and other detection frames will calculate the corresponding IOU value one by one, and the detection frame whose value exceeds the threshold is filtered out. But in the actual situation, when two different objects are very close, due to the relatively large IOU value, after the NMS algorithm, there is often only one detection frame left, which may cause missed detection.DIoU-NMS considers not only the IOU value, but also the distance between the center points of two boxes. If the IOU between the two frames is relatively large, but the distance between them is relatively far, it will be considered as the detection frame of different objects and will not be filtered out.
YOLOv7
Anchor boxes
YOLOv7 family is an anchor-based model.In these models, the general philosophy is to first create lots of potential bounding boxes, then select the most promising options to match to our target objects; slightly moving and resizing them as necessary to obtain the best possible fit.The basic idea is that we draw a grid on top of each image and, at each grid intersection (anchor point), generate candidate boxes (anchor boxes) based on a number of anchor sizes. That is, the same set of boxes is repeated at each anchor point. However, one issue with this approach is that our target, ground truth, boxes can range in size — from tiny to huge! Therefore, it is usually not possible to define a single set of anchor sizes that can be matched to all targets. For this reason, anchor-based model architectures usually employ a Feature-Pyramid-Network (FPN) to assist with this.
Center Priors
If we put 3 anchor boxes in each anchor point of each of the grids, we end up with a lot of boxes.The issue is that most of these predictions are not going to contain an object, which we classify as ‘background’.To make the problem cheaper computationally, the YOLOv7 loss finds first the anchor boxes that are likely to match each target box and treats them differently — these are known as the center prior anchor boxes. This process is applied at each FPN head, for each target box, across all images in batch at once.
model reparameterization
Model re-parametrization techniques merge multiple computational modules into one at inference stage. The model re-parameterization technique can be regarded as an ensemble technique, and we can divide it into two categories, i.e., module-level ensemble and model-level ensemble.
Model scaling
Model scaling is a way to scale up or down an already designed model and make it fit in different computing devices.Network architecture search (NAS) is one of the commonly used model scaling methods.
efficient layer aggregation networks(ELAN)
VovNet/OSANet
VovNet, short for “Variance-based Overparameterized Convolutional Networks,” is a convolutional neural network (CNN) architecture proposed by Lee et al. in their paper “Variance-based Overparameterization for Robustness” in 2019. VovNet is designed to improve the robustness of deep neural networks, particularly in the context of image classification tasks.The key idea behind VovNet is to introduce variance-based overparameterization to enhance the representation power of CNNs. Overparameterization involves increasing the number of parameters in a neural network, which can improve the model’s ability to learn complex patterns and features.
VovNet achieves variance-based overparameterization by introducing multiple “VovNet blocks.” Each VovNet block is designed to capture different levels of granularity within the input data. Instead of using a single set of convolutional filters for all spatial dimensions, VovNet employs different filters for each spatial dimension. This allows the network to capture variations in features at different scales, leading to more robust representations.
One-shot aggregation (OSA) module is designed which is more efficient than Dense Block in DenseNet.By cascading OSA module, an efficient object detection network VoVNet is formed.One-shot aggregation (OSA) module is designed to aggregate its feature in the last layer at once.It has much less Memory access cost (MAC) than that with dense block.Also, OSA improves GPU computation efficiency. The input sizes of intermediate layers of OSA module are constant. Hence, it is unnecessary to adopt additional 1×1 conv bottleneck to reduce dimension. The means it consists of fewer layers.
CSPVOVNet
It combines CSPNet and VoVNet and considers the gradient path for improvement, so that the weights of different layers can learn more diverse features to improve accuracy.
Deep supervision
When training deep networks, auxiliary head and auxiliary classifiers are often added to the middle layer of the neural network to improve stability, convergence speed, and avoid gradient disappearance problems, that is, to use auxiliary loss for shallow layers. Network weights for training, this technique is called Deep Supervision.
dynamic label assignment
Label Assigner is a mechanism that considers the network prediction results together with the ground truth and then assigns soft labels. In the past, the definition of the target label was usually to use a hard label that follows the ground truth. In recent years, it has also been used to perform some optimization operations on the prediction results of the model and the ground truth to obtain a soft label. This mechanism is called label assigner in this paper.The author discusses three methods of assigning soft labels on the auxiliary head and lead head: Independent, Lead head guided label assigner,Coarse-to-fine lead head guided label assigner.Independent:Auxiliary head and lead head perform label assignment with ground truth respectively, which is the most used method at present.Lead head guided label assigner:Since the lead head has a stronger learning ability than the auxiliary head, the soft label obtained by optimizing the prediction result of the lead head and the ground truth can better express the distribution and correlation between the data and the ground truth.Then use the soft label as the target of the auxiliary head and lead head for training, so that the shallower auxiliary head can directly learn the information that the lead head has learned, while the lead head pays more attention to the unlearned residual information. Coarse-to-fine lead head guided label assigner:This part is also the soft label obtained by optimizing the prediction result of the lead head and the ground truth. The difference is that two different soft labels will be generated: coarse label and fine label, where the fine label is the same as the soft label of the lead head , coarse label is used for auxiliary head.
Optimal Transport Assignment
The simplest approach is to define an Intersection over Union (IoU) threshold and decide based on that. While this generally works, it becomes problematic when there are occlusions, ambiguity or when multiple objects are very close together. Optimal Transport Assignment (OTA) aims to solve some of these problems by considering label assignment as a global optimization problem for each image.YOLOv7 implements simOTA (introduced in the YOLOX paper), a simplified version of the OTA problem.
Model EMA
When training a model, it can be beneficial to set the values for the model weights by taking a moving average of the parameters that were observed across the entire training run, as opposed to using the parameters obtained after the last incremental update. This is often done by maintaining an exponentially weighted average (EMA) of the model parameters, in practice, this usually means maintaining another copy of the model to store these averaged weights. This technique has been employed in several training schemes for popular models such as training MNASNet, MobileNet-V3 and EfficientNet.
The approach to EMA taken by the YOLOv7 authors is slightly different to other implementations as, instead of using a fixed decay, the amount of decay changes based on the number of updates that have been made.
Loss algorithm
we can break down the algorithm used in the YOLOv7 loss calculation into the following steps:
- For each FPN head (or each FPN head and Aux FPN head pair if Aux heads used):
Find the Center Prior anchor boxes.
Refine the candidate selection through the simOTA algorithm. Always use lead FPN heads for this.
Obtain the objectness loss score using Binary Cross Entropy Loss between the predicted objectness probability and the Complete Intersection over Union (CIoU) with the matched target as ground truth. If there are no matches, this is 0.
If there are any selected anchor box candidates, also calculate (otherwise they are just 0):
- The box (or regression) loss, defined as the mean(1 - CIoU) between all candidate anchor boxes and their matched target.
- The classification loss, using Binary Cross Entropy Loss between the predicted class probabilities for each anchor box and a one-hot encoded vector of the true class of the matched target.
If model uses auxiliary heads, add each component obtained from the aux head to the corresponding main loss component (i.e., x = x + aux_wt*aux_x). The contribution weight (aux_wt) is defined by a predefined hyperparameter.
Multiply the objectness loss by the corresponding FPN head weight (predefined hyperparameter).
Multiply each loss component (objectness, classification, regression) by their contribution weight (predefined hyperparameter).
Sum the already weighted loss components.
Multiply the final loss value by the batch size.
using yolov7
github address: https://github.com/WongKinYiu/yolov7;
Format converter:https://github.com/wy17646051/UA-DETRAC-Format-Converter
potential ideas
efficiency
In order to enhance the real-time detection of the network, researchers generally analyze the number of parameters, calculation amount and calculation density from the aspects of model parameters, calculation amount, memory access times, input-output channel ratio, element-wise operation, etc. In fact, these research methods are similar to ShuffleNetV2 at that time.
NAS(Neural Architecture Search)
NAS was an inspiring work out of Google that lead to several follow up works such as ENAS, PNAS, and DARTS. It involves training a recurrent neural network (RNN) controller using reinforcement learning (RL) to automatically generate architectures.
Vision Transformer
The core conclusion in the original ViT paper is that when there is enough data for pre-training, ViT’s performance will exceed CNN, breaking through the limitation of transformer lack of inductive bias, you can use it in Better transfer results in downstream tasks. However, when the training data set is not large enough, the performance of ViT is usually worse than that of ResNets of the same size, because Transformer lacks inductive bias compared with CNN, that is, a priori knowledge, a good assumption in advance.
improve choosing anchor box
datasets
PASCAL VOC 2007, VOC 2012, Microsoft COCO (Common Objects in Context).
UA-DETRAC: https://detrac-db.rit.albany.edu/ https://www.kaggle.com/datasets/patrikskalos/ua-detrac-fix-masks-two-wheelers?resource=download https://colab.research.google.com/github/hardik0/Multi-Object-Tracking-Google-Colab/blob/main/Towards-Realtime-MOT-Vehicle-
https://github.com/hardik0/Towards-Realtime-MOT/tree/master
Tracking.ipynb#scrollTo=y6KZeLt9ViDe
https://github.com/wy17646051/UA-DETRAC-Format-Converter/tree/main
MIO-TCD:https://tcd.miovision.com/
KITTI:https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark
TRANCOS: https://gram.web.uah.es/data/datasets/trancos/index.html
STREETS:https://www.kaggle.com/datasets/ryankraus/traffic-camera-object-detection: single class
VERI-Wild: https://github.com/PKU-IMRE/VERI-Wild
https://universe.roboflow.com/7-class/11-11-2021-09.41
https://universe.roboflow.com/szabo/densitytrafficcontroller-1axlm
https://universe.roboflow.com/future-institute-of-technology-1wuwl/indian-vehicle-set-1
https://universe.roboflow.com/cv-2022-kyjj6/tesi
https://universe.roboflow.com/vehicleclassification-kxtkb/vehicle_classification-fvssn
https://universe.roboflow.com/urban-data/urban-data
https://www.kaggle.com/datasets/ashfakyeafi/road-vehicle-images-dataset
https://github.com/MaryamBoneh/Vehicle-Detection
References
https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-1-33220ebc1d09
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-2-85ee99d114a1
https://medium.com/@chingi071/yolo%E6%BC%94%E9%80%B2-3-yolov4%E8%A9%B3%E7%B4%B0%E4%BB%8B%E7%B4%B9-5ab2490754ef
https://zhuanlan.zhihu.com/p/183261974
https://sh-tsang.medium.com/review-vovnet-osanet-an-energy-and-gpu-computation-efficient-backbone-network-for-real-time-3b26cd035887
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-yolov7-%E8%AB%96%E6%96%87%E9%96%B1%E8%AE%80-97b0e914bdbe
https://towardsdatascience.com/yolov7-a-deep-dive-into-the-current-state-of-the-art-for-object-detection-ce3ffedeeaeb
https://towardsdatascience.com/neural-architecture-search-limitations-and-extensions-8141bec7681f
https://learnopencv.com/fine-tuning-yolov7-on-custom-dataset/#The-Training-Experiments-that-We-Will-Carry-Out
https://learnopencv.com/yolov7-object-detection-paper-explanation-and-inference/