transformer

Introduction

The Transformer is a deep learning architecture introduced in the paper “Attention is All You Need” by Vaswani et al., published in 2017.
The Transformer is based on the self-attention mechanism, which allows it to capture long-range dependencies in sequences more effectively than traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs). The key components of the Transformer are:Self-Attention Mechanism,Encoder-Decoder Architecture,Multi-Head Attention,Positional Encoding,Feed-Forward Neural Networks.

Self-Attention Mechanism

The self-attention mechanism allows the model to weigh the importance of different words in a sentence while encoding the sequence. It computes the attention scores for each word in the input sequence based on its relationships with other words. By attending to relevant words, the model can focus on the most informative parts of the sequence.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

The fifth step is to multiply each value vector by the softmax score.

The sixth step is to sum up the weighted value vectors.This produces the output of the self-attention layer at this position (for the first word).

Multi-Head Attention

To capture different types of dependencies and relationships, the Transformer uses multi-head attention. It performs self-attention multiple times with different learned projection matrices, allowing the model to attend to various aspects of the input.

With multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices.We concat the matrices then multiply them by an additional weights matrix WO to condense these eight down into a single matrix.

sequence-to-sequence model

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items.the model is composed of an encoder and a decoder.The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks.By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms.The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences.A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”.Attention allows the model to focus on the relevant parts of the input sequence as needed.

attention

An attention model differs from a classic sequence-to-sequence model in two main ways:First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder;Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:Look at the set of encoder hidden states it received,Give each hidden state a score,Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores.

Encoder-Decoder Architecture

The Transformer architecture consists of two main components: the encoder and the decoder. The encoder takes an input sequence and processes it, while the decoder generates an output sequence based on the encoded representation.

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence.The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Layer Normalization

In traditional normalization techniques like Batch Normalization, the activations of a layer are normalized by computing the mean and variance over a batch of examples. This normalization helps stabilize and accelerate the training process, especially for deeper networks. However, it introduces a dependency on the batch size during training, which can be problematic in scenarios where batch sizes vary or during inference when processing individual samples.

Layer Normalization addresses this dependency by computing the mean and variance across all the units within a single layer for each training example. This means that normalization is done independently for each sample and does not rely on batch statistics.

Positional Encoding

Since Transformers do not inherently have positional information like RNNs, positional encodings are added to the input embeddings. These positional encodings provide the model with information about the order of the elements in the input sequence,or the distance between different words in the sequence.

Training

loss function

cross entropy

The cross-entropy loss calculates the negative log-likelihood of the true class’s predicted probability.

Kullback–Leibler divergence

Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure of how one probability distribution diverges from another.KL divergence measures the average amount of information lost when using Q to approximate P. It is not symmetric.

decoding

greedy decoding

In greedy decoding, at each step of sequence generation, the model selects the most likely output token based on its predicted probability distribution. It chooses the token with the highest probability without considering the impact on future decisions. This means that the model makes locally optimal choices at each step without considering the global context of the entire sequence.For example, in machine translation, a model using greedy decoding will predict each target word one at a time, selecting the word with the highest probability given the source sentence and previously generated words. The process continues iteratively until an end-of-sentence token is generated.

Beam search

In beam search, instead of selecting only the most likely token at each step, the algorithm maintains a fixed-size list, known as the “beam,” containing the most promising candidate sequences. The beam size determines how many candidate sequences are considered at each decoding step.
At the beginning of the decoding process, the beam is initialized with a single token representing the start of the sequence. At each step, the model generates the probabilities for the next possible tokens and expands the beam with the top-k most likely candidate sequences based on their cumulative probabilities. The k represents the beam size, and higher values of k result in a more diverse exploration of possibilities.

references

http://fancyerii.github.io/2019/03/09/transformer-illustrated/#%E6%A6%82%E8%BF%B0
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
http://jalammar.github.io/illustrated-transformer/
https://colah.github.io/posts/2015-09-Visual-Information/
https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

segmentation

Image segmentation

Image segmentation is a sub-domain of computer vision and digital image processing which aims at grouping similar regions or segments of an image under their respective class labels.

Semantic segmentation

Semantic segmentation refers to the classification of pixels in an image into semantic classes.

Instance segmentation

Instance segmentation models classify pixels into categories on the basis of “instances” rather than classes.

Panoptic segmentation

Panoptic segmentation can be expressed as the combination of semantic segmentation and instance segmentation where each instance of an object in the image is segregated and the object’s identity is predicted.

Neural networks that perform segmentation typically use an encoder-decoder structure where the encoder is followed by a bottleneck and a decoder or upsampling layers directly from the bottleneck (like in the FCN).

Regular Expressions

Regular expressions

A formal language for specifying text strings

rules

Disjunctions:
Letters inside square brackets[]: [A-Z]
pipe |: a|b|c
Negation in Disjunction: Ss

?: When placed after a character or a group, the question mark makes it optional, meaning that the character or group can occur zero or one time.
When placed after a quantifier, such as , +, or ?, it modifies the quantifier to be non-greedy or lazy. A non-greedy quantifier matches as few characters as possible, while a greedy quantifier matches as many characters as possible. :0 or more of previous char
+:1 or more of previous char
.:any char

Anchors:
^: The begining. $: The end.

chatgpt

chatgpt api

parameters

two important parameters that you can use with OpenAI’s GPT API to help control text generation behavior: temperature and top_p sampling.
Temperature is a parameter that controls the “creativity” or randomness of the text generated by GPT-3. A higher temperature (e.g., 0.7) results in more diverse and creative output, while a lower temperature (e.g., 0.2) makes the output more deterministic and focused.In practice, temperature affects the probability distribution over the possible tokens at each step of the generation process. A temperature of 0 would make the model completely deterministic, always choosing the most likely token.

Top_p sampling is an alternative to temperature sampling. Instead of considering all possible tokens, GPT-3 considers only a subset of tokens (the nucleus) whose cumulative probability mass adds up to a certain threshold (top_p).

chatgpt methods

We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.

To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.

Alt text

reference

https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683

https://openai.com/blog/chatgpt

python(5) subprocess and logging

subprocess

You can use the Python subprocess module to create new processes, connect to their input and output, and retrieve their return codes and/or output of the process.

subprocess run

The subprocess.run() method is a convenient way to run a subprocess and wait for it to complete. Once the subprocess is started, the run() method blocks until the subprocess completes and returns a CompletedProcess object, which contains the return code and output of the subprocess.The check argument is an optional argument of the subprocess.run() function in the Python subprocess module. It is a boolean value that controls whether the function should check the return code of the command being run.When check is set to True, the function will check the return code of the command and raise a CalledProcessError exception if the return code is non-zero. The exception will have the return code, stdout, stderr, and command as attributes.

Note that when there is “&” at the end in command, the run will not wait the process to end. When using eval$command in shell, it’s the same. Don’t use &.

subprocess Popen

subprocess.Popen is a lower-level interface to running subprocesses, while subprocess.run is a higher-level wrapper around Popen that is intended to be more convenient to use. Popen allows you to start a new process and interact with its standard input, output, and error streams. It returns a handle to the running process that can be used to wait for the process to complete, check its return code, or terminate it.
In general, you should use run if you just need to run a command and capture its output and Popen if you need more control over the process, such as interacting with its input and output streams.The Popen class has several methods that allow you to interact with the process, such as communicate(), poll(), wait(), terminate(), and kill().

subprocess call

subprocess.call() is a function in the Python subprocess module that is used to run a command in a separate process and wait for it to complete. It returns the return code of the command, which is zero if the command was successful, and non-zero if it failed.subprocess.call() is useful when you want to run a command and check the return code, but do not need to capture the output.

subprocess check_output

check_output is a function in the subprocess module that is similar to run(), but it only returns the standard output of the command, and raises a CalledProcessError exception if the return code is non-zero.

Subprocess Pipe

A pipe is a unidirectional communication channel that connects one process’s standard output to another’s standard input. A pipe can connect the output of one command to the input of another, allowing the output of the first command to be used as input to the second command.Pipes can be created using the subprocess module with the Popen class by specifying the stdout or stdin argument as subprocess.PIPE.

logging

Logging provides a set of convenience functions for simple logging usage. These are debug(), info(), warning(), error() and critical().
The default level is WARNING, which means that only events of this level and above will be tracked, unless the logging package is configured to do otherwise.

logging config

logging.basicConfig(format=’%(levelname)s %(asctime)s %(process)d %(message)s’, level=logging.DEBUG)
If not printing after config, note that you should config this before importing other libraries incase the config is overriden.

reference

https://www.datacamp.com/tutorial/python-subprocess
https://docs.python.org/3/howto/logging.html

Large Language Model

basic ideas

Zero-Shot Learning

zero-shot learning, in which your model learns how to classify classes that it hasn’t seen before.

Contrastive Language-Image Pretraining (CLIP)

Just like traditional supervised models, CLIP has two stages: the training stage (learning) and the inference stage (making predictions).
In the training stage, CLIP learns about images by “reading” auxiliary text (i.e. sentences) corresponding to each image. CLIP aims to minimize the difference between the encodings of the image and it’s corresponding text.
In the inference stage, we setup the typical classification task by first obtaining a list of all possible labels.Each label will then be encoded by the pretrained text encoder from Step 1.Now that we have the label encodings, T₁ to Tₙ, we can take the image that we want to classify, feed it through the pretrained image encoder, and compute how similar the image encoding is to each text label encoding using a distance metric called cosine similarity.

contrastive learning

Contrastive learning is a machine learning technique used to learn the general features of a dataset without labels by teaching the model which data points are similar or different.It looks at which pairs of data points are “similar” and “different” in order to learn higher-level features about the data, before even having a task such as classification or segmentation.

SimCLRv2

The entire process can be described concisely in three basic steps:

For each image in our dataset, we can perform two augmentation combinations (i.e. crop + resize + recolor, resize + recolor, crop + recolor, etc.). We want the model to learn that these two images are “similar” since they are essentially different versions of the same image.

To do so, we can feed these two images into our deep learning model (Big-CNN such as ResNet) to create vector representations for each image. The goal is to train the model to output similar representations for similar images.

Lastly, we try to maximize the similarity of the two vector representations by minimizing a contrastive loss function.

Meta-learning

The idea of meta-learning is to learn the learning process.

In-context Learning

uring in-context learning, we give the LM a prompt that consists of a list of input-output pairs that demonstrate a task. At the end of the prompt, we append a test input and allow the LM to make a prediction just by conditioning on the prompt and predicting the next tokens.

Instruction learning

Instruction learning is an idea proposed by the team led by Quoc V. Le at Google DeepMind in a paper titled ‘Finetuned Language Models Are Zero-Shot Learners’ in 2021. The purpose of instruction learning and prompt learning is to explore the knowledge inherent in language models. The difference is that prompts aim to stimulate the completion ability of the language model, such as generating the second half of a sentence based on the first half or filling in the blanks. Instructions aim to stimulate the understanding ability of the language model by providing more explicit instructions, enabling the model to take correct actions. The advantage of instruction learning is that after fine-tuning through multitask learning, it can also perform zero-shot learning on other tasks, while prompt learning is specific to one task. Its generalization ability is not as strong as instruction learning.

Diffusion Model

In machine learning, the Diffusion Model refers to a class of algorithms or models that utilize diffusion processes for various tasks, such as data clustering, image segmentation, or graph-based learning. The basic principle of the Diffusion Model in machine learning is to propagate information or labels through the connections or edges of a graph or network. The diffusion process starts with initial information or labels assigned to some nodes in the graph, and it gradually spreads and influences the neighboring nodes based on certain rules or algorithms.

Stable Diffusion

Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions.

Prompt engineering

Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics.Prompt engineering focuses on crafting the optimal textual input by selecting the appropriate words, phrases, sentence structures, and punctuation.

RLHF(Reinforcement Learning from Human Feedback)

generation

Auto-regressive language generation is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions. The length T of the word sequence is usually determined on-the-fly and corresponds to the timestep
t=T the EOS token is generated from the probability distribution.

decoding methods

Greedy search

Greedy search is the simplest decoding method. It selects the word with the highest probability as its next word.

Beam search

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.
The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0. Nevertheless, n-gram penalties have to be used with care. An article generated about the city New York should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text!
When using transformers library:
beam_output = model.generate(**model_inputs,max_new_tokens=40,num_beams=5,no_repeat_ngram_size=2,early_stopping=True)

Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.
In transformers, we simply set the parameter num_return_sequences to the number of highest scoring beams that should be returned. Make sure though that num_return_sequences <= num_beams!

sampling

In its most basic form, sampling means randomly picking the next word according to its conditional probability distribution.

temperature

a temperatureparameter to adjust the probability distribution of the output. The larger the parameter value, the smoother the distribution looks, that is, the gap between high probability and low probability is narrowed (not so sure about the output); of course, the smaller it is, the more obvious the gap between high probability and low probability (more sure about the output). If it tends to 0, it is the same as Greedy Search.

Top-K

In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words.GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

Top-P

Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words.

While in theory, Top-p seems more elegant than Top-K, both methods work well in practice. Top-p can also be used in combination with Top-K, which can avoid very low ranked words while allowing for some dynamic selection.

sample_outputs = model.generate(**model_inputs, max_new_tokens=40,do_sample=True,top_k=50,top_p=0.95,num_return_sequences=3)

models:

LLaMA

LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.

FastChat

other models

https://github.com/baichuan-inc/baichuan-7B

LLM benchmarks

MMLU

The MMLU benchmark covers 57 general knowledge areas such as “Humanities”, “Social Sciences”, and “STEM”. Each question in it contains four possible options, and each question has only one correct answer.
there are two main ways to get information from a model to evaluate it:
Get the output probabilities for a particular set of tokens and compare them to the alternatives in the sample;
Take the text generated by the model (iteratively generated one by one using the method described above), and compare these texts with the alternatives in the sample.

C-Eval

A Chinese knowledge and reasoning test set covering four major fields: humanities, social sciences, natural sciences, and other disciplines. It consists of 52 subjects, including calculus, linear algebra, and more, covering topics from secondary school to university-level studies, graduate studies, and professional examinations. The test set comprises a total of 13,948 questions.

code generation benchmarks

HumanEval

HumanEval is proposed to evaluate the functional correctness on a set of 164 handwritten programming problems with unit tests.
Functional correctness is measured for synthesizing programs from docstrings.Each problem includes a function signature, docstring, body, and several unit tests. pass@k metric, is used where k code samples are generated per problem see if any sample passes the unit tests.

MBPP (Mostly Basic Python Programming)

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.

APPS(Automated Programming Progress Standard)

The APPS dataset consists of 5000 training and 5000 test examples of coding problems. Most of the APPS tests problems are not formulated as single-function synthesis tasks, but rather as full-program synthesis.The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions.

MultiPL-E

MultiPL-E is a multi-programming language benchmark for evaluating the code generation performance of large language model (LLMs) of code.

DS-1000

a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas.

references

https://towardsdatascience.com/understanding-zero-shot-learning-making-ml-more-human-4653ac35ccab
https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607
http://ai.stanford.edu/blog/understanding-incontext/
https://www.8btc.com/article/6813626
https://en.wikipedia.org/wiki/Stable_Diffusion
LLaMA: Open and Efficient Foundation Language Models
https://www.promptingguide.ai/
https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76
https://nl2code.github.io/
https://yaofu.notion.site/C-Eval-6b79edd91b454e3d8ea41c59ea2af873
https://huggingface.co/blog/zh/evaluating-mmlu-leaderboard
https://github.com/datawhalechina/hugging-llm/blob/main/content/ChatGPT%E5%9F%BA%E7%A1%80%E7%A7%91%E6%99%AE%E2%80%94%E2%80%94%E7%9F%A5%E5%85%B6%E4%B8%80%E7%82%B9%E6%89%80%E4%BB%A5%E7%84%B6.md
https://huggingface.co/blog/how-to-generate

Introduction to deep learning in computer vision

Basic architecture

CNN

Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions, parameter sharing and equivariant representations. Moreover, convolution provides a means for working with inputs of variable size.

We assume that the size of the input image is nn, and the size of the filter is ff (note that f is generally an odd number). The size of the output image after convolution is (n-f+1)* (n-f+1).During the convolution process, padding is sometimes necessary to avoid information loss. Additionally, adjusting the stride allows for compression of some information.If we want to perform convolution on a three-channel RGB image, the corresponding filter group would also have three channels. The process involves convolving each individual channel with its corresponding filter, summing up the results, and then adding the sums of the three channels together. The resulting sum of the 27 multiplications is considered as one pixel value of the output image. The filters for different channels can be different. When the input has specific height, width, and channel dimensions, the filters can have different height and width, but the number of channels must match the input.Pooling layers are commonly included in many CNNs. The purpose of pooling layers is to reduce the size of the model, improve computational speed, and simultaneously decrease noise to enhance the robustness of the extracted features.

Important networks in the history of computer vision

LeNet-5

LeNet-5, developed by Yann LeCun et al. in 1998, was one of the first successful convolutional neural networks (CNNs) for handwritten digit recognition. It laid the foundation for modern CNN architectures and demonstrated the power of deep learning in computer vision tasks. “Gradient-Based Learning Applied to Document Recognition” by Yann LeCun et al. (1998). LeNet’s network architecture has seven layers: convolutional layer (Convolutions, C1), pooling layer (Subsampling, S2), convolutional layer (C3), pooling layer (S4), fully connected convolutional layer ( C5), fully connected layer (F6), Gaussian connected layer (output).The input layer is a 28x28 one-dimensional image, and the Filter size is 5x5. The output channels of the first Filter and the second Filter are 6 and 16 respectively, and both use Sigmoid as the activation function.
The window of the pooling layer is 2x2, the stride is 2, and the sampling is performed using average pooling. The number of neurons in the last fully connected layer is 120 and 84, respectively.The last output layer is the Gaussian connection layer, which uses the RBF function (radial Euclidean distance function) to calculate the Euclidean distance between the input vector and the parameter vector.

AlexNet

AlexNet, introduced by Alex Krizhevsky et al. in 2012, was a breakthrough CNN architecture that won the ImageNet competition and popularized deep learning in computer vision. It demonstrated the effectiveness of deep CNNs for image classification tasks and paved the way for subsequent advancements.”ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky et al. (2012).AlexNet’s architecture has eight layers, using a total of five convolutional layers and three fully connected layers, which is deeper than the LeNet model.The first to fifth layers are convolutional layers, where the first, second, and fifth convolutional layers are followed by pooling layers, and Maxpooling with a size of 3x3 and a stride of 2 is used.The sixth to eighth layers are fully connected layers. Changing the Sigmoid used by LeNet to ReLU can avoid the problem of vanishing gradient due to too deep neural network layers or too small gradients.

VGGNet

The VGGNet, proposed by Karen Simonyan and Andrew Zisserman in 2014, is known for its simplicity and depth. It consisted of deep networks with stacked 3x3 convolutional layers, showing that increasing network depth led to improved performance on image classification tasks.”Very Deep Convolutional Networks for Large-Scale Image Recognition” by Karen Simonyan and Andrew Zisserman (2014).Compared with AlexNet, VGGNet adopts a deeper network. It is characterized by repeated use of the same set of basic modules, and uses small convolution kernels instead of medium and large convolution kernels in AlexNet. Its architecture consists of n VGG Blocks and 3 full connections composed of layers.The structure of VGG Block is composed of 3x3 convolutional layers (kernel size=3x3, stride=1, padding=”same”) of different numbers (the number is hyperparameters), and 2x2 Maxpooling (pool size=2, stride=2).VGGNet has many different structures, such as VGG11, VGG13, VGG16, VGG19, the difference lies in the number of layers of the network (the number of convolutional layers and the number of fully connected layers). The common VGGNet refers to VGG16.

Network in Network

“Network in Network” (NiN) refers to a neural network architecture proposed by Lin et al. in their paper titled “Network In Network” published in 2014. NiN is designed to enhance the expressive power of deep neural networks by incorporating micro neural networks called “MLPs (Multi-Layer Perceptrons)” or “1x1 Convolutions” within the network structure.

The key idea behind NiN is to replace traditional convolutional layers with what they call “MLP Convolutional Layers” or “1x1 Convolutional Layers.” These layers consist of a series of fully connected layers (MLPs) applied at every pixel location of the input. The purpose is to capture complex local feature interactions and enable more non-linear transformations.By using 1x1 convolutions, NiN can model non-linear relationships within the channels of the input feature map. This allows for richer and more powerful representations compared to standard convolutional layers.
The 1x1 convolutional layer not only integrates the information of different channels at the same position, but also can reduce or increase the dimension of the channel.

GoogLeNet (Inception-v1)

GoogLeNet, presented by Christian Szegedy et al. in 2015, introduced the Inception module and demonstrated the importance of multi-scale feature extraction. It achieved high accuracy while maintaining computational efficiency, inspiring subsequent Inception versions and influencing network designs.”Going Deeper with Convolutions” by Christian Szegedy et al. (2015).
GoogLeNet was designed to address the challenges of deep neural networks, such as computational efficiency and overfitting, while maintaining high accuracy in image classification tasks. It introduced several novel concepts and architectural innovations that made it stand out from previous CNN architectures at the time.

The key feature of GoogLeNet is the Inception module, which utilizes parallel convolutional filters of different sizes (1x1, 3x3, 5x5) to capture features at various scales. This allows the network to learn and represent both local and global features effectively. Additionally, it incorporates 1x1 convolutions for dimensionality reduction and introduces a technique called “bottleneck” layers to reduce the computational complexity.

Inception

In the context of computer vision, “inception” refers to the Inception module or the Inception architecture used in deep convolutional neural networks (CNNs). The Inception module was introduced in the GoogLeNet architecture (also known as Inception-v1) as a key component for efficient and effective feature extraction.The Inception module aims to capture multi-scale features by employing multiple parallel convolutional filters of different sizes within the same layer. By using a combination of 1x1, 3x3, and 5x5 convolutional filters, the Inception module allows the network to learn and extract features at various spatial scales. The Inception module extracts different features through convolution of three different sizes and 3x3 Maxpooling, and then concatenates these four results together with the channel axis. This way of increasing the width of the network can capture more features and details of the picture.But if the sizes of these four results are different, both the convolutional layer and the pooling layer use padding=”same” and stride=1 to ensure the size of the input feature map.

ResNet

ResNet, developed by Kaiming He et al. in 2015, introduced the concept of residual learning. It utilized skip connections or shortcuts to address the vanishing gradient problem and enabled training of extremely deep networks, leading to significant performance gains in image classification and other tasks.”Deep Residual Learning for Image Recognition” by Kaiming He et al. (2015).

DenseNet

DenseNet, introduced by Gao Huang et al. in 2016, focused on dense connectivity patterns between layers. It aimed to alleviate the vanishing gradient problem, promote feature reuse, and encourage better gradient flow. DenseNet achieved competitive results while reducing the number of parameters compared to other architectures. “Densely Connected Convolutional Networks” by Gao Huang et al. (2016).’

ResNeXt

ResNeXt is a convolutional neural network (CNN) architecture that builds upon the concepts introduced by the ResNet (Residual Network) model. ResNeXt was proposed by Xie et al. in their paper titled “Aggregated Residual Transformations for Deep Neural Networks” in 2017.

The main idea behind ResNeXt is to leverage the concept of “cardinality” to improve the representational power of the network. Cardinality refers to the number of independent pathways or branches within a block of the network. In ResNeXt, instead of using a single pathway in each block, multiple parallel pathways are employed.

references

https://juejin.cn/post/7104845694225088525
https://www.showmeai.tech/article-detail/221
https://medium.com/ching-i/%E5%8D%B7%E7%A9%8D%E7%A5%9E%E7%B6%93%E7%B6%B2%E7%B5%A1-cnn-%E7%B6%93%E5%85%B8%E6%A8%A1%E5%9E%8B-lenet-alexnet-vgg-nin-with-pytorch-code-84462d6cf60c

time complexity

time complexity

big O notation

Big O notation measures the asymptotic growth of a function. f (n) = O(g(n)) if for all sufficiently large n, f (n) is at most a constant factor larger than g(n).

Ω and Θ notation

We say f (n) = Ω(g(n)) if g(n) = O(f (n)).
We say f (n) = Θ(g(n)) if f (n) = O(g(n)) and g(n) = O(f (n)).

types of complexity

Worst-case complexity: what is the largest possible running time on any input of size n?
Average-case complexity: what is the average running time on a random input of size n?
Best-case complexity: what is the smallest possible running time on any input of size n?

Graph algorithm

ways of representing graphs

adjacency matrix

For graph with n vertices this is an n × n matrix A, where $A{ij}$ = 1 if there is an edge from node i to node j, $A{ij}$ = 0 otherwise.
If the graph is undirected, the matrix is symmetric

adjacency lists

For each vertex, keep a list of its neighbors.

incidence matrix

The incidence matrix of an undirected graph with n vertices and m edges is an n × m matrix B where $B{ij}$ = 1 if the i’th vertex is part of the j’th edge, $B{ij}$ = 0 otherwise.

Two fundamental graph exploration algorithms

Depth First Search (DFS)

Breadth First Search (BFS)

For the BFS tree, this gives the shortest (fewest number of steps) paths from s to all other nodes

Greedy Algorithms

Greedy algorithms are algorithms that build a solution step by step by always choosing the currently best option.

Interval Scheduling

Input: A list of intervals
Output: Maximum number of these intervals that can be chosen without getting any overlaps.
Solution:Pick the one that ends first.
Prove correctness of such an algorithm: Common strategy for analyzing greedy algorithms: prove that the algorithm always “stays ahead” of the optimal solution.

Job Scheduling With Minimum Lateness

Input: A list of jobs, each job has a deadline di
, and a duration ti
(how long it takes to finish the job)
Output: Smallest possible maximum lateness in a schedule for doing
all jobs.
Solution: Pick the job with smallest di.

Shortest path

It is helpful to instead consider a more general problem. Let us try
to find the shortest paths from s to all other vertices: Dijkstra’s algorithm: we have some set D of vertices we have found
the shortest path to, and each step we add a new vertex to D. add the vertex outside D which is closest to
s when using only vertices in D as intermediate vertices.

Divide & Conquer

Algorithms that split the input into significantly smaller parts, recursively solves each part, and then combines the subresults (somehow).

Merge sort

O(n log n).

Polynomial multiplication

$T(n) = O(n^{1.59})$.
Using FFT, get time O(n log n) for Polynomial Multiplication

Unit cost model and Bit cost model

Unit cost model: assume all numbers fit in machine registers so that basic arithmetic operations take constant time.
Bit cost model: account for size of numbers and the time it takes to manipulate them.

Integer multiplication

Karatsuba’s algorithm: $T(n) = O(n^{1.59})$.

Master Theorem

Dynamic Programming

Split a problem into smaller subproblems such that results from one
subproblem can be reused when solving others

Fibonacci numbers

The Fibonacci numbers are a classic number sequence in
mathematics, defined by the linear recurrence
f0 = 0; f1 = 1; and fn = fn−1 + fn−2 for n ≥ 2

Weighted Interval Scheduling

Input: A list of intervals [s1; t1]; [s2; t2]; : : : ; [sn; tn], each interval
[si
; ti
] has a weight wi
Output: What is the maximum total weight of these intervals that
can be chosen without getting any overlaps

Knapsack

Input: A capacity C and a list of objects, each object has a value vi
and weight wi
Output: Subset S of objects such that
$\sum{i∈S} wi ≤ C$ and $\sum{i∈S} vi$ is maximized.

top-down and bottom-up fashion

top-down fashion: we start at the end result and
recursively compute results for relevant subproblems.

bottom-up fashion: we iteratively compute results for larger and larger subproblems.

Characteristics of dynamic programming

A problem is amenable to dynamic program if we can define a set
of subproblems such that:

  1. The number of different subproblems is as small as possible.
  2. There is some ordering of subproblems from “small” to “large”
  3. The value of a subproblem can be efficiently computed given
    the values of some set of smaller subproblems.

Sequence Alignment

Input: Strings x and y of lengths m and n, parameters ‹ and ¸
Output: Minimum cost of an alignment of x and y with parameters $\sigma$ and $\alpha$.
$\alpha is the cost of aligning two different characters with each other
$\sigma$ is the cost of not aligning a character

Matrix Chain Multiplication

Network Flow

The Max-Flow problem

Input: Flow network G.
Output: Flow f maximizing the value v(f ).
Solution: The Ford-Fulkerson Algorithm O(C(m + n))
or the scaling algorithm with O(m2
log(C)) or Edmonds-Karp algorithm with O(nm(n + m)) .

Edge Cuts

An edge cut of a graph is a set of edges such that their removal would disconnect the graph.

Minimum s-t-Cut

Input: A flow network G with source s and sink t.
Output: An s-t cut A; B of G minimizing the capacity c(A; B).

The Max-Flow-Min-Cut Theorem

For every flow network G, the maximum flow from s to t equals the
minimum capacity of an s-t cut in G.

Vertex Cuts

A vertex cut in a graph is a set of vertices such that if we remove
them, the graph splits into more than one connected component.

Matchings

A matching in a graph is a set M of edges such that no vertex appears in more than one edge of M.Of particular interest to us will be bipartite graphs.

Maximum Bipartite Matching

Input: A bipartite graph G
Output: A matching M in G of maximum possible size.

Edge-Disjoint Paths

Given a directed graph with source and sink, what is maximum
number of edge-disjoint paths from s to t?
(edge-disjoint = no edge used by more than one path)

Project Selection?

object detection

Object detection

Object detection is the field of computer vision that deals with the localization and classification of objects contained in an image or video.
Deep learning-based approaches use neural network architectures like RetinaNet, YOLO (You Only Look Once), CenterNet, SSD (Single Shot Multibox detector), Region proposals (R-CNN, Fast-RCNN, Faster RCNN, Cascade R-CNN) for feature detection of the object, and then identification into labels.The YOLO series current provide the SOTA of object detection in real-time.

Object detection usually consists of the following parts:
Input: Refers to the input of the picture
Backbone: A skeleton pre-trained on ImageNet
Neck: Usually used to extract feature maps of different levels
Head: Predict the object category and the detector of bndBox, usually divided into two types: Dense Prediction (one stage), Sparse Prediction (two stage).

metric

mAP

Mean average precision (mAP) is the average value of AP of each category.The AP metric is the area under curve (AUC) of PR curve (Precision-Recall curve).This metric provides a balanced assessment of precision and recall by considering the area under the precision-recall curve. PR curve is a curve drawn with Recall as the X axis and Precision as the Y axis. The higher the Precision and Recall, the better the performance of the model, so the closer to the upper right corner, the better. The AP metric incorporates the Intersection over Union (IoU) measure to assess the quality of the predicted bounding boxes. If the IOU is greater than the threshold (Threshold, usually set to 0.5), and the same Ground Truth can only be calculated once, it will be considered as a TP.

Intersection over Union(IoU)

IoU is the ratio of the intersection area to the union area of the predicted bounding box and the ground truth bounding box. It measures the overlap between the ground truth and predicted bounding boxes.

Flops and FPS

FLOPS (Floating-Point Operations Per Second) is a measure of a computer’s or a processor’s performance in terms of the number of floating-point operations it can perform per second.Higher FLOPS values generally indicate faster computational capabilities.FPS (Frames Per Second) is a measure of how many individual frames (images) a video system can display or process per second.

Non-Maximum Suppression (NMS)

Non-Maximum Suppression (NMS) is a post-processing technique used in object detection algorithms to reduce the number of overlapping bounding boxes and improve the overall detection quality.

Model History

Traditionally, object detection is done by Viola Jones Detector \cite{viola2001rapid}, Histogram of Oriented Gradients (HOG) detector, or Deformable Part-based Model (DPM) before deep learning took off. With deep learning, object detection generally is categorized into 2 categories: one-stage detector and two-stage detector. Two-stage detector is started by Regions with CNN features (RCNN). Spatial Pyramid Pooling Networks (SPPNet), Fast RCNN, Faster RCNN, and Feature Pyramid Networks (FPN) were proposed after it. Limited by the poor speed of the two-stage detector, the one-stage detector came with the first representative You Only Look Once (YOLO). Subsequent versions of YOLO, Single Shot MultiBox Detector (SSD), RetinaNet, CornerNet, CenterNet,DETR were proposed latter. YOLOv7 performs best compared to most detectors.

RCNN

The object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classspecific linear SVMs.

YOLO series

The history of YOLO (You Only Look Once) dates back to 2015 when the original YOLO algorithm was introduced in “You Only Look Once: Unified, Real-Time Object Detection,” .The original YOLO architecture used a convolutional neural network (CNN) to process the entire image and output a fixed number of bounding boxes along with their associated class probabilities. It divided the image into a grid and applied convolutional operations to predict bounding boxes within each grid cell, considering multiple scales and aspect ratios.In subsequent years, YOLO underwent several iterations and improvements to enhance its accuracy and speed. YOLOv2 was introduced in 2016, featuring an updated architecture that incorporated anchor boxes and multi-scale predictions. YOLOv3 followed in 2018, introducing further advancements, including feature pyramid networks (FPN) and Darknet-53 as the backbone architecture.

YOLO (You Only Look Once)

Network architecture is inspired by the GoogLeNet model for image classification.The network has 24 convolutional layers followed by 2 fully connected layers. They pretrain our convolutional layers on the ImageNet 1000-class competition dataset.For pretraining they use the first 20 convolutional layers followed by a average-pooling layer and a fully connected layer. Then they add four convolutional layers and two fully connected layers with randomly initialized weights. The final layer predicts both class probabilities and bounding box coordinates. They optimize for sum-squared error in the output of the model by increasing the loss from bounding box coordinate predictions and decreasing the loss from confidence predictions for boxes that don’t contain objects and predicting the square root of the bounding box width and height instead of the width and height directly. They design the loss to handle the problem that the sum-squared error weights localization error equally with classification error and also equally weights errors in large boxes and small boxes.

YOLOv2

The improvements of YOLOv2 in YOLOv1:
The author adds a batch normalization layer after each convolutional layer, no longer uses dropout.
YOLOv1 uses a 224x224 image classifier. YOLO2 increases the resolution to 448x448.
Because YOLOv1 has difficulty learning to adapt to the shape of different objects during training, resulting in poor performance in precise positioning. YOLOv2 also tries to use rectangles of different shapes as anchor boxes (Anchor Box).Unlike YOLOv1, Anchor Box does not directly predict the coordinate value of bndBox, but predicts the offset (offset value of coordinates) and confidence scores (confidence) of Anchor Box.
In Faster R-CNN and SSD, the size of the Anchor Box is manually selected.YOLOv2 uses the k-means clustering method to perform cluster analysis on the bndBox of the objects in the training set.
YOLOv2 uses a new basic model (feature extractor) Darknet-19, including 19 convolutional layers, 5 maxpooling layers.

YOLO9000

YOLO9000 is a model that can detect more than 9,000 categories proposed on the basis of YOLOv2. Its main contribution is to propose a joint training strategy for classification and detection.For the detection data set, it is used to learn the bounding box (bndBox), confidence (confidence) and object classification of the predicted object, while for the classification data set, it is only used to learn classification, but it can greatly expand the capabilities of the model the type of object detected.

The author proposes a hierarchical classification method (Hierarchical classification),which establishes a tree structure WordTree according to the affiliation between categories.When softmax is performed, it is not performed on all categories, but on the categories of the same level.When making predictions, it traverses down from the root node, selects the child node with the highest probability at each level, and calculates the product of all conditional probabilities from the node to the root node. Stop when the product of the conditional probability is less than a certain threshold, and use the current node to represent the predicted category.

YOLOv3

On the basis of YOLOv2, YOLOv3 improves the network backbone, uses multi-scale feature maps (feature maps) for detection, and uses multiple independent Logistic regression classifiers instead of softmax to predict category classification.YOLOv3 proposes a new backbone: Darknet-53, from layer 0 to layer 74, a total of 53 convolutional layers, and the rest are Resnet layers.Darknet-53 joins Resnet Network (Residual Network) to solve the gradient problem.
YOLOv3 draws on the Feature Pyramid Network (FPN) method, uses multi-scale feature maps to detect objects of different sizes, and improves the prediction ability of small objects.The feature map of each scale will predict 3 Anchor priors, and the size of the Anchor priors is clustered using K-means.

Feature Pyramid Networks (FPN)

The main idea behind FPNs is to leverage the nature of convolutional layers — which reduce the size of the feature space and increase the coverage of each feature in the initial image — to output predictions at different scales.FPNs provide semantically strong features at multiple scales which make them extremely well suited for object detection.

YOLOv4

Bag-of-Freebies refers to the techniques used in network training, which does not affect the time of reasoning and prediction, mainly including:
Data augmentation: Random erase, CutOut, Hide-and-seek, Grid mask, GAN, MixUp, CutMix;Regularization methods: DropOut, DropConnect;Dealing with data imbalance: focal loss, Online hard example mining, Hard negative example mining;Handle bndBox regression problems: MSE, IOU, GIOU, DIOU/CIOU.

Bag-of-specials refers to the techniques used in network design or post-processing, which slightly increases the time of reasoning and prediction, but can improve the accuracy, mainly including:Receptive field: SPP, ASPP, RFB;Feature Fusion: FPN, PAN;Attention mechanism: attention module;
Activation functions: Swish, Mish;NMS: Soft-NMS、DIoU NMS.

The architecture of the YOLOv4 model consists of three parts
BackBone: CSPDarknet53; Neck: SPP+PAN; HEAD: YOLO HEAD.

Cross Stage Partial Network (CSPNet)

The main purpose of CSPNet is to enable the network architecture to obtain richer gradient fusion information and reduce the amount of calculation.
The method is to first divide the feature map of the Base layer into two parts, and then pass through transition -> concatenation -> transition. parts merged.This approach allows CSPNet to solve three problems:
Increase the learning ability of CNN, even if the model is lightweight, it can maintain accuracy;
Remove the computing bottleneck structure with high computing power (reduce computing);Reduce memory usage.

SPP+PAN

SPP (Spatial Pyramid Pooling): Concate all feature maps in the last layer of the network, and then continue to connect CNN module.
PANet (Path Aggregation Network): Improve on the basis of FPN.

CutMix

CutMix is ​​a data enhancement method proposed in 2019. The method is to cut off a part of the area but not fill it with 0 pixels, but randomly fill the area pixel values ​​​​of other data in the training set.Mixup: Mix two random samples proportionally, and the classification results are distributed proportionally.utout: Randomly cut out some areas in the sample and fill them with 0 pixel values, and the classification result remains unchanged.

Mosaic data augmentation

Whilst common transforms in object detection tend to be augmentations such as flips and rotations, the YOLO authors take a slightly different approach by applying Mosaic augmentation; which was previously used by YOLOv4, YOLOv5 and YOLOX models.The objective of mosaic augmentation is to overcome the observation that object detection models tend to focus on detecting items towards the centre of the image. The key idea is that, if we stitch multiple images together, the objects are likely to be in positions and contexts that are not normally observed in images seen in the dataset; which should force the features learned by the model to be more position invariant. It uses random scaling and cropping to mix and stitch 4 kinds of pictures for training. When using Mosaic training, the data of 4 pictures can be directly calculated, so that the size of the Mini-batch does not need to be large.

Post-mosaic affine transforms

As we noted earlier, the mosaics that we are creating are significantly bigger than the image sizes we will use to train our model, so we will need to do some sort of resizing here. Whilst this would work, this is likely to result in some very small objects, as we are essentially resizing four images to the size of one - which is likely to become a problem where the domain already contains very small bounding boxes. Additionally, each of our mosaics are structurally quite similar, with an image in each quadrant. Recalling that our aim was to make the model more robust to position changes, this may not actually help that much; as the model is likely just to start looking in the middle of each quadrant.To overcome this, one approach that we can take is to simply take a random crop from our mosaic. This will still provide the variability in positioning whilst preserving the size and aspect ratio of the target objects. At this point, it may also be a good opportunity to add in some other transforms such as scaling and rotation to add even more variability.

DropBlock regularization

Dropout, which randomly deletes the number of neurons, but the network can still learn the same information from adjacent activation units.
DropBlock randomly deletes the entire local area, and the network will focus on learning certain features to achieve correct classification and get better generalization effects.

Class label smoothing

In multi-classification tasks, the output is usually normalized with softmax, and then one-hot label is used to calculate the cross-entropy loss function to train the model. However, the use of one-hot vector representation can easily lead to the problem of network overfitting, so Label Smoothing is to make the one-hot label softer, so that the phenomenon of overfitting can be effectively suppressed when calculating the loss, and the generalization ability of the model can be improved.

Mish activation

Mish is a continuously differentiable non-monotonic activation function. Compared with ReLU, Mish’s gradient is smoother, and it allows a smaller negative gradient when it is negative, which can stabilize the network gradient flow and has better generalization ability.
$f(x) = xtanh(ln(1+e^x))$.

Multiinput weighted residual connections (MiWRC)

YOLOv4 refers to the architecture and method of EfficientDet , and uses the multi-input weighted residual connection (MiWRC).The backbone of EfficientDet uses EfficientNet, Neck is BiFPN.EfficientNet-B0 is constructed by multiple MBConv Blocks. MBConv Block refers to the Inverted Residual Block of MobileNet V2.The design of MBConv is to first increase the dimension and then reduce the dimension, which is different from the operation of the residual block to first reduce the dimension and then increase the dimension. This design allows MobileNetV2 to better use the residual connection to improve Accuracy.The idea of ​​MiWRC is derived from BiFPN. In FPN, the features obtained by each layer are regarded as equal, while MiWRC believes that the features of different layers should have different importance, and different weight ratios should be given to the features of different scales.

loss

2 problems with using IOU loss:When the predict box (predict bndBox) and the target box (ground truth) do not intersect, the IOU is 0, which cannot reflect the distance between the two boxes. At this time, the loss function is not derivable, that is to say, the gradient cannot be calculated, so it cannot Optimizing the case where two boxes do not intersect;The IOU cannot reflect the coincidence size of the prediction frame and the target frame.
Subsequent GIoU, DIoU, CIoU are based on IOU loss to add a penalty item:
GIOU loss (Generalized IOU loss):C is the minimum bounding box of the target box Ground Truth and the prediction box Predict.
$L_{GIOU}=1-IOU+\frac{|C-B\cupB^{gt}|}{|C|}$.
DIOU loss (Distance IOU loss) considers the overlapping area and the center point distance, and adds a penalty term to minimize the center point distance between the two boxes.CIOU loss (Complete IOU loss) adds a penalty item based on DIOU, taking into account the factor of aspect ratio.

CmBN (Cross mini-Batch Normalization)

BN is to normalize the current mini-batch, but often the batch size is very small, and uneven sampling may occur, which may cause problems in normalization. Therefore, there are many Batch Normalization methods for small batch sizes.The idea of ​​CBN is to calculate the previous mini-batch together, but not keep too many mini-batches. The method is to normalize the results of the current and the current 3 mini-batches.The CmBN newly created by YOLOv4 is based on CBN for modification, and does not update calculations between mini-batches, but updates network parameters after a batch is completed.

Self-Adversarial Training (SAT)

SAT is a data enhancement method innovated by the author, which is completed in two stages:First, forward-propagate the training samples, and then modify the image pixels (without modifying the network weights) during back-propagation to reduce the performance of model detection. In this way, the neural network can perform adversarial attacks on itself. Creates the illusion that there is no detected object in the picture. This first stage is actually increasing the difficulty of training samples.The second stage is to use the modified pictures to train the model.

Eliminate grid sensitivity

The author observed a video of object detection and found that because the center point of the detected object is mostly located close to the center of the Grid, it is difficult to detect when it is on the edge of the Grid. The author believes that the problem that the center point of the detected object is mostly located close to the center point of the Grid is because of the gradient of the Sigmoid function. Therefore, the author made some changes in the Sigmoid function, multiplying Sigmoid by a value greater than 1, and taking into account the sensitivity of different Grid sizes to boundary effects, using (1+x)Sigmoid — (0.5x), where When the Grid resolution is higher, the x will be higher.

Cosine annealing scheduler

Cosine annealing is to use the cosine function to adjust the learning rate. At the beginning, the learning rate will be slowly reduced, then accelerated halfway, and finally slowed down again.

Optimal hyperparameters

Use Genetic Algorithms (Evolutionary Algorithms) to select hyperparameters. The method is to randomly combine hyperparameters for training, then select the best 10% hyperparameters and then randomly combine and train them, and finally select the best model.

SAM-block (Spatial Attention Module)

SAM is derived from the CBAM (Convolutional Block Attention Module) paper, which provides two attention mechanism techniques.

DIoU-NMS

In the classic NMS, the detection frame with the highest confidence and other detection frames will calculate the corresponding IOU value one by one, and the detection frame whose value exceeds the threshold is filtered out. But in the actual situation, when two different objects are very close, due to the relatively large IOU value, after the NMS algorithm, there is often only one detection frame left, which may cause missed detection.DIoU-NMS considers not only the IOU value, but also the distance between the center points of two boxes. If the IOU between the two frames is relatively large, but the distance between them is relatively far, it will be considered as the detection frame of different objects and will not be filtered out.

YOLOv7

Anchor boxes

YOLOv7 family is an anchor-based model.In these models, the general philosophy is to first create lots of potential bounding boxes, then select the most promising options to match to our target objects; slightly moving and resizing them as necessary to obtain the best possible fit.The basic idea is that we draw a grid on top of each image and, at each grid intersection (anchor point), generate candidate boxes (anchor boxes) based on a number of anchor sizes. That is, the same set of boxes is repeated at each anchor point. However, one issue with this approach is that our target, ground truth, boxes can range in size — from tiny to huge! Therefore, it is usually not possible to define a single set of anchor sizes that can be matched to all targets. For this reason, anchor-based model architectures usually employ a Feature-Pyramid-Network (FPN) to assist with this.

Center Priors

If we put 3 anchor boxes in each anchor point of each of the grids, we end up with a lot of boxes.The issue is that most of these predictions are not going to contain an object, which we classify as ‘background’.To make the problem cheaper computationally, the YOLOv7 loss finds first the anchor boxes that are likely to match each target box and treats them differently — these are known as the center prior anchor boxes. This process is applied at each FPN head, for each target box, across all images in batch at once.

model reparameterization

Model re-parametrization techniques merge multiple computational modules into one at inference stage. The model re-parameterization technique can be regarded as an ensemble technique, and we can divide it into two categories, i.e., module-level ensemble and model-level ensemble.

Model scaling

Model scaling is a way to scale up or down an already designed model and make it fit in different computing devices.Network architecture search (NAS) is one of the commonly used model scaling methods.

efficient layer aggregation networks(ELAN)
VovNet/OSANet

VovNet, short for “Variance-based Overparameterized Convolutional Networks,” is a convolutional neural network (CNN) architecture proposed by Lee et al. in their paper “Variance-based Overparameterization for Robustness” in 2019. VovNet is designed to improve the robustness of deep neural networks, particularly in the context of image classification tasks.The key idea behind VovNet is to introduce variance-based overparameterization to enhance the representation power of CNNs. Overparameterization involves increasing the number of parameters in a neural network, which can improve the model’s ability to learn complex patterns and features.
VovNet achieves variance-based overparameterization by introducing multiple “VovNet blocks.” Each VovNet block is designed to capture different levels of granularity within the input data. Instead of using a single set of convolutional filters for all spatial dimensions, VovNet employs different filters for each spatial dimension. This allows the network to capture variations in features at different scales, leading to more robust representations.

One-shot aggregation (OSA) module is designed which is more efficient than Dense Block in DenseNet.By cascading OSA module, an efficient object detection network VoVNet is formed.One-shot aggregation (OSA) module is designed to aggregate its feature in the last layer at once.It has much less Memory access cost (MAC) than that with dense block.Also, OSA improves GPU computation efficiency. The input sizes of intermediate layers of OSA module are constant. Hence, it is unnecessary to adopt additional 1×1 conv bottleneck to reduce dimension. The means it consists of fewer layers.

CSPVOVNet

It combines CSPNet and VoVNet and considers the gradient path for improvement, so that the weights of different layers can learn more diverse features to improve accuracy.

Deep supervision

When training deep networks, auxiliary head and auxiliary classifiers are often added to the middle layer of the neural network to improve stability, convergence speed, and avoid gradient disappearance problems, that is, to use auxiliary loss for shallow layers. Network weights for training, this technique is called Deep Supervision.

dynamic label assignment

Label Assigner is a mechanism that considers the network prediction results together with the ground truth and then assigns soft labels. In the past, the definition of the target label was usually to use a hard label that follows the ground truth. In recent years, it has also been used to perform some optimization operations on the prediction results of the model and the ground truth to obtain a soft label. This mechanism is called label assigner in this paper.The author discusses three methods of assigning soft labels on the auxiliary head and lead head: Independent, Lead head guided label assigner,Coarse-to-fine lead head guided label assigner.Independent:Auxiliary head and lead head perform label assignment with ground truth respectively, which is the most used method at present.Lead head guided label assigner:Since the lead head has a stronger learning ability than the auxiliary head, the soft label obtained by optimizing the prediction result of the lead head and the ground truth can better express the distribution and correlation between the data and the ground truth.Then use the soft label as the target of the auxiliary head and lead head for training, so that the shallower auxiliary head can directly learn the information that the lead head has learned, while the lead head pays more attention to the unlearned residual information. Coarse-to-fine lead head guided label assigner:This part is also the soft label obtained by optimizing the prediction result of the lead head and the ground truth. The difference is that two different soft labels will be generated: coarse label and fine label, where the fine label is the same as the soft label of the lead head , coarse label is used for auxiliary head.

Optimal Transport Assignment

The simplest approach is to define an Intersection over Union (IoU) threshold and decide based on that. While this generally works, it becomes problematic when there are occlusions, ambiguity or when multiple objects are very close together. Optimal Transport Assignment (OTA) aims to solve some of these problems by considering label assignment as a global optimization problem for each image.YOLOv7 implements simOTA (introduced in the YOLOX paper), a simplified version of the OTA problem.

Model EMA

When training a model, it can be beneficial to set the values for the model weights by taking a moving average of the parameters that were observed across the entire training run, as opposed to using the parameters obtained after the last incremental update. This is often done by maintaining an exponentially weighted average (EMA) of the model parameters, in practice, this usually means maintaining another copy of the model to store these averaged weights. This technique has been employed in several training schemes for popular models such as training MNASNet, MobileNet-V3 and EfficientNet.

The approach to EMA taken by the YOLOv7 authors is slightly different to other implementations as, instead of using a fixed decay, the amount of decay changes based on the number of updates that have been made.

Loss algorithm

we can break down the algorithm used in the YOLOv7 loss calculation into the following steps:

  1. For each FPN head (or each FPN head and Aux FPN head pair if Aux heads used):
    Find the Center Prior anchor boxes.
    Refine the candidate selection through the simOTA algorithm. Always use lead FPN heads for this.
    Obtain the objectness loss score using Binary Cross Entropy Loss between the predicted objectness probability and the Complete Intersection over Union (CIoU) with the matched target as ground truth. If there are no matches, this is 0.
    If there are any selected anchor box candidates, also calculate (otherwise they are just 0):
  • The box (or regression) loss, defined as the mean(1 - CIoU) between all candidate anchor boxes and their matched target.
  • The classification loss, using Binary Cross Entropy Loss between the predicted class probabilities for each anchor box and a one-hot encoded vector of the true class of the matched target.
    If model uses auxiliary heads, add each component obtained from the aux head to the corresponding main loss component (i.e., x = x + aux_wt*aux_x). The contribution weight (aux_wt) is defined by a predefined hyperparameter.
    Multiply the objectness loss by the corresponding FPN head weight (predefined hyperparameter).
  1. Multiply each loss component (objectness, classification, regression) by their contribution weight (predefined hyperparameter).

  2. Sum the already weighted loss components.

  3. Multiply the final loss value by the batch size.

using yolov7

github address: https://github.com/WongKinYiu/yolov7;
Format converter:https://github.com/wy17646051/UA-DETRAC-Format-Converter

potential ideas

efficiency

In order to enhance the real-time detection of the network, researchers generally analyze the number of parameters, calculation amount and calculation density from the aspects of model parameters, calculation amount, memory access times, input-output channel ratio, element-wise operation, etc. In fact, these research methods are similar to ShuffleNetV2 at that time.

NAS(Neural Architecture Search)

NAS was an inspiring work out of Google that lead to several follow up works such as ENAS, PNAS, and DARTS. It involves training a recurrent neural network (RNN) controller using reinforcement learning (RL) to automatically generate architectures.

Vision Transformer

The core conclusion in the original ViT paper is that when there is enough data for pre-training, ViT’s performance will exceed CNN, breaking through the limitation of transformer lack of inductive bias, you can use it in Better transfer results in downstream tasks. However, when the training data set is not large enough, the performance of ViT is usually worse than that of ResNets of the same size, because Transformer lacks inductive bias compared with CNN, that is, a priori knowledge, a good assumption in advance.

improve choosing anchor box

datasets

PASCAL VOC 2007, VOC 2012, Microsoft COCO (Common Objects in Context).
UA-DETRAC: https://detrac-db.rit.albany.edu/ https://www.kaggle.com/datasets/patrikskalos/ua-detrac-fix-masks-two-wheelers?resource=download https://colab.research.google.com/github/hardik0/Multi-Object-Tracking-Google-Colab/blob/main/Towards-Realtime-MOT-Vehicle-
https://github.com/hardik0/Towards-Realtime-MOT/tree/master
Tracking.ipynb#scrollTo=y6KZeLt9ViDe
https://github.com/wy17646051/UA-DETRAC-Format-Converter/tree/main
MIO-TCD:https://tcd.miovision.com/
KITTI:https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark
TRANCOS: https://gram.web.uah.es/data/datasets/trancos/index.html
STREETS:https://www.kaggle.com/datasets/ryankraus/traffic-camera-object-detection: single class
VERI-Wild: https://github.com/PKU-IMRE/VERI-Wild

https://universe.roboflow.com/7-class/11-11-2021-09.41
https://universe.roboflow.com/szabo/densitytrafficcontroller-1axlm
https://universe.roboflow.com/future-institute-of-technology-1wuwl/indian-vehicle-set-1
https://universe.roboflow.com/cv-2022-kyjj6/tesi
https://universe.roboflow.com/vehicleclassification-kxtkb/vehicle_classification-fvssn
https://universe.roboflow.com/urban-data/urban-data
https://www.kaggle.com/datasets/ashfakyeafi/road-vehicle-images-dataset
https://github.com/MaryamBoneh/Vehicle-Detection

References

https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-1-33220ebc1d09
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-2-85ee99d114a1
https://medium.com/@chingi071/yolo%E6%BC%94%E9%80%B2-3-yolov4%E8%A9%B3%E7%B4%B0%E4%BB%8B%E7%B4%B9-5ab2490754ef
https://zhuanlan.zhihu.com/p/183261974
https://sh-tsang.medium.com/review-vovnet-osanet-an-energy-and-gpu-computation-efficient-backbone-network-for-real-time-3b26cd035887
https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-yolov7-%E8%AB%96%E6%96%87%E9%96%B1%E8%AE%80-97b0e914bdbe
https://towardsdatascience.com/yolov7-a-deep-dive-into-the-current-state-of-the-art-for-object-detection-ce3ffedeeaeb
https://towardsdatascience.com/neural-architecture-search-limitations-and-extensions-8141bec7681f
https://learnopencv.com/fine-tuning-yolov7-on-custom-dataset/#The-Training-Experiments-that-We-Will-Carry-Out
https://learnopencv.com/yolov7-object-detection-paper-explanation-and-inference/

speech embedding

Contrastive Predictive Coding

Contrastive Predictive Coding (CPC) learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. It describes a form of unidirectional modeling in the feature space,
where the model learns to predict the near future frames in
an acoustic sequence while contrasting with frames from other
sequences or frames from a more distant time.

Autoregressive Predictive Coding

The APC approach uses an autoregressive model to encode
temporal information of past acoustic sequence; the model then
predicts future frames like a recurrent-based LM while
conditioning on past frames.

TERA

TERA, which stands for Transformer Encoder Representations from Alteration, is a self-supervised speech pre-training
method.

experiment design

Amount of labeled data needed to perform well.
with pre-trained and without pre-trained.