深度学习(3) - 计算框架PyTorch

PyTorch

张量

张量是一种专门的数据结构,与数组和矩阵非常相似。张量类似于 NumPy 的 ndarrays,但张量可以在 GPU 或其他硬件加速器上运行,张量也针对自动微分进行了优化。

默认情况下,张量是在 CPU 上创建的。我们需要使用 .to 方法将张量显式地移动到 GPU(在检查 GPU 是否可用后)。请记住,跨设备复制大型张量在时间和内存方面可能代价高昂!

将结果存储到操作数中的操作称为就地操作。它们用 后缀表示。例如:x.copy(y)、x.t_() 将会改变 x。

数据加载

PyTorch 提供了两个数据基本类型:torch.utils.data.DataLoader 和 torch.utils.data.Dataset,它们使您可以使用预加载的数据集以及您自己的数据。

神经网络

torch.nn 命名空间提供了构建您自己的神经网络所需的所有构建块。PyTorch 中的每个模块都是 nn.Module 的子类。

自动微分

在训练神经网络时,最常用的算法是 反向传播。在此算法中,参数(模型权重)会根据损失函数相对于给定参数的 梯度 进行调整。

为了计算这些梯度,PyTorch 有一个内置的微分引擎,称为 torch.autograd。它支持任何计算图的梯度自动计算。当我们需要计算损失函数相对于这些变量的梯度,我们设置了这些张量的requiresgrad 属性。可以在创建张量时设置requires_grad的值,或者之后使用x.requires_grad(True)方法。要计算这些导数,我们调用loss.backward(),然后从w.grad和b.grad中检索值。

要阻止一个张量被跟踪历史,可以调用.detach()方法将其与计算历史分离,并阻止它未来的计算记录被跟踪。为了防止跟踪历史记录(和使用内存),可以将代码块包装在 with torch.no_grad(): 中。

优化

在训练循环中,优化分三个步骤进行:
调用optimizer.zero_grad()重置模型参数的梯度。默认情况下,梯度会累加;为了防止重复计数,我们在每次迭代中显式地将其归零。

使用loss.backward()反向传播预测损失。PyTorch 将损失相对于每个参数的梯度存储起来。

在获得梯度后,我们调用optimizer.step()根据反向传播中收集的梯度调整参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
# Set the model to training mode - important for batch normalization and dropout layers
# Unnecessary in this situation but added for best practices
model.train()
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
pred = model(X)
loss = loss_fn(pred, y)

# Backpropagation
loss.backward()
optimizer.step()
optimizer.zero_grad()

if batch % 100 == 0:
loss, current = loss.item(), batch * batch_size + len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
# Set the model to evaluation mode - important for batch normalization and dropout layers
# Unnecessary in this situation but added for best practices
model.eval()
size = len(dataloader.dataset)
num_batches = len(dataloader)
test_loss, correct = 0, 0

# Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
# also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
with torch.no_grad():
for X, y in dataloader:
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()

test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

保存和加载模型

保存和加载模型权重

PyTorch 模型将学习到的参数存储在内部状态字典中,称为 state_dict。这些可以通过 torch.save 方法持久化。

1
2
model = models.vgg16(weights='IMAGENET1K_V1')
torch.save(model.state_dict(), 'model_weights.pth')

要加载模型权重,您需要首先创建一个相同模型的实例,然后使用 load_state_dict() 方法加载参数。

1
2
3
model = models.vgg16() # we do not specify ``weights``, i.e. create untrained model
model.load_state_dict(torch.load('model_weights.pth', weights_only=True))
model.eval()

注意请务必在推理之前调用 model.eval() 方法,以将 dropout 和批归一化层设置为评估模式。如果不这样做,将产生不一致的推理结果。

保存和加载具有形状的模型

1
2
torch.save(model, 'model.pth')
model = torch.load('model.pth', weights_only=False),

参考文献

https://pytorch.ac.cn/tutorials/beginner/basics/tensor_tutorial.html

深度学习(2)-计算性能

命令式编程和符号式编程

命令式编程(Imperative Programming)和符号式编程(Symbolic Programming)是两种不同的编程范式。命令式编程描述了计算机应该执行的具体步骤来达到预期的结果。这种编程方式侧重于描述“怎么做”,而不是“做什么”。符号式编程侧重于表达计算的本质,而不是具体的执行步骤。程序表达的是问题的解决方案,而不是具体的操作细节,代码通常只在完全定义了过程之后才执行计算。

异步计算

异步计算(Asynchronous Computing)是一种编程和计算模式,它允许程序在等待某些操作完成的同时继续执行其他任务。与同步计算不同,异步计算可以提高程序的响应性和性能,尤其是在处理 I/O 操作、网络请求、长时间运行的任务等情况时。

GPU,TPU,NPU

GPU即Graphics Processing Unit,中文名为图形处理单元。需要注意的是,GPU没有独立工作的能力,必须由CPU进行控制调用才能工作,且GPU的功耗一般比较高。TPU即Tensor Processing Unit,中文名为张量处理器,是谷歌在2015年6月的IO开发者大会上推出了为优化自身的TensorFlow框架而设计打造的一款计算神经网络专用芯片。NPU即Neural-network Processing Unit,中文名为神经网络处理器,它采用“数据驱动并行计算”的架构,特别擅长处理视频、图像类的海量多媒体数据。

并行计算

CUDA是NVIDIA提供的一种GPU并行计算框架。对于GPU本身的编程,使用的是CUDA语言来实现的。在编写程序中,当我们使用了 .cuda() 时,其功能是让我们的模型或者数据从CPU迁移到GPU上(默认是0号GPU)当中,通过GPU开始计算。

当我们的服务器上有多个GPU,我们应该指明我们使用的GPU是哪一块,如果我们不设置的话,tensor.cuda()方法会默认将tensor保存到第一块GPU上,等价于tensor.cuda(0)。

PyTorch提供了两种多卡训练的方式,分别为DataParallel和DistributedDataParallel(以下我们分别简称为DP和DDP)。这两种方法中官方更推荐我们使用DDP,因为它的性能更好。DP只能在单机上使用,且DP是单进程多线程的实现方式,比DDP多进程多线程的方式会效率低一些。

首先我们来看单机多卡DP,通常使用一种叫做数据并行 (Data parallelism) 的策略,即将计算任务划分成多个子任务并在多个GPU卡上同时执行这些子任务。主要使用到了nn.DataParallel函数。

1
2
3
4
5
model = Net()
model.cuda() # 模型显示转移到CUDA上

if torch.cuda.device_count() > 1: # 含有多张GPU的卡
model = nn.DataParallel(model, device_ids=[0,1]) # 单机多卡DP训练

不过通过DP进行分布式多卡训练的方式容易造成负载不均衡,有可能第一块GPU显存占用更多,因为输出默认都会被gather到第一块GPU上。为此Pytorch也提供了torch.nn.parallel.DistributedDataParallel(DDP)方法来解决这个问题。

针对每个GPU,启动一个进程,然后这些进程在最开始的时候会保持一致(模型的初始化参数也一致,每个进程拥有自己的优化器),同时在更新模型的时候,梯度传播也是完全一致的,这样就可以保证任何一个GPU上面的模型参数就是完全一致的,所以这样就不会出现DataParallel那样显存不均衡的问题。

参考文献

https://zh.d2l.ai/chapter_computational-performance/index.html

https://datawhalechina.github.io/thorough-pytorch/%E7%AC%AC%E4%BA%8C%E7%AB%A0/2.3%20%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97%E7%AE%80%E4%BB%8B.html

https://datawhalechina.github.io/thorough-pytorch/%E7%AC%AC%E4%BA%8C%E7%AB%A0/2.4%20AI%E7%A1%AC%E4%BB%B6%E5%8A%A0%E9%80%9F%E8%AE%BE%E5%A4%87.html

Neural Networks

training nerual networks

suggestions

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data.

Our next step is to set up a full training + evaluation skeleton and gain trust in its correctness via a series of experiments. At this stage it is best to pick some simple model that you couldn’t possibly have screwed up somehow.
Tips & tricks for this stage:
Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome.
Make sure to disable any unnecessary fanciness. As an example, definitely turn off any data augmentation at this stage.
When plotting the test loss run the evaluation over the entire (large) test set. Do not just plot test losses over batches and then rely on smoothing them in Tensorboard. We are in pursuit of correctness and are very willing to give up time for staying sane.
Monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it.
Train an input-independent baseline, (e.g. easiest is to just set all your inputs to zero). This should perform worse than when you actually plug in your data without zeroing it out. Does it? i.e. does your model learn to extract any information out of the input at all?

The approach I like to take to finding a good model has two stages: first get a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss). The reason I like these two stages is that if we are not able to reach a low error rate with any model at all that may again indicate some issues, bugs, or misconfiguration.

In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.

references

http://karpathy.github.io/2019/04/25/recipe/

transformer

Introduction

The Transformer is a deep learning architecture introduced in the paper “Attention is All You Need” by Vaswani et al., published in 2017.
The Transformer is based on the self-attention mechanism, which allows it to capture long-range dependencies in sequences more effectively than traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs). The key components of the Transformer are:Self-Attention Mechanism,Encoder-Decoder Architecture,Multi-Head Attention,Positional Encoding,Feed-Forward Neural Networks.

Self-Attention Mechanism

The self-attention mechanism allows the model to weigh the importance of different words in a sentence while encoding the sequence. It computes the attention scores for each word in the input sequence based on its relationships with other words. By attending to relevant words, the model can focus on the most informative parts of the sequence.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

The fifth step is to multiply each value vector by the softmax score.

The sixth step is to sum up the weighted value vectors.This produces the output of the self-attention layer at this position (for the first word).

Multi-Head Attention

To capture different types of dependencies and relationships, the Transformer uses multi-head attention. It performs self-attention multiple times with different learned projection matrices, allowing the model to attend to various aspects of the input.

With multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices.We concat the matrices then multiply them by an additional weights matrix WO to condense these eight down into a single matrix.

sequence-to-sequence model

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items.the model is composed of an encoder and a decoder.The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks.By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms.The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences.A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”.Attention allows the model to focus on the relevant parts of the input sequence as needed.

attention

An attention model differs from a classic sequence-to-sequence model in two main ways:First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder;Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:Look at the set of encoder hidden states it received,Give each hidden state a score,Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores.

Encoder-Decoder Architecture

The Transformer architecture consists of two main components: the encoder and the decoder. The encoder takes an input sequence and processes it, while the decoder generates an output sequence based on the encoded representation.

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence.The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Layer Normalization

In traditional normalization techniques like Batch Normalization, the activations of a layer are normalized by computing the mean and variance over a batch of examples. This normalization helps stabilize and accelerate the training process, especially for deeper networks. However, it introduces a dependency on the batch size during training, which can be problematic in scenarios where batch sizes vary or during inference when processing individual samples.

Layer Normalization addresses this dependency by computing the mean and variance across all the units within a single layer for each training example. This means that normalization is done independently for each sample and does not rely on batch statistics.

Positional Encoding

Since Transformers do not inherently have positional information like RNNs, positional encodings are added to the input embeddings. These positional encodings provide the model with information about the order of the elements in the input sequence,or the distance between different words in the sequence.

Training

loss function

cross entropy

The cross-entropy loss calculates the negative log-likelihood of the true class’s predicted probability.

Kullback–Leibler divergence

Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure of how one probability distribution diverges from another.KL divergence measures the average amount of information lost when using Q to approximate P. It is not symmetric.

decoding

greedy decoding

In greedy decoding, at each step of sequence generation, the model selects the most likely output token based on its predicted probability distribution. It chooses the token with the highest probability without considering the impact on future decisions. This means that the model makes locally optimal choices at each step without considering the global context of the entire sequence.For example, in machine translation, a model using greedy decoding will predict each target word one at a time, selecting the word with the highest probability given the source sentence and previously generated words. The process continues iteratively until an end-of-sentence token is generated.

Beam search

In beam search, instead of selecting only the most likely token at each step, the algorithm maintains a fixed-size list, known as the “beam,” containing the most promising candidate sequences. The beam size determines how many candidate sequences are considered at each decoding step.
At the beginning of the decoding process, the beam is initialized with a single token representing the start of the sequence. At each step, the model generates the probabilities for the next possible tokens and expands the beam with the top-k most likely candidate sequences based on their cumulative probabilities. The k represents the beam size, and higher values of k result in a more diverse exploration of possibilities.

references

http://fancyerii.github.io/2019/03/09/transformer-illustrated/#%E6%A6%82%E8%BF%B0
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
http://jalammar.github.io/illustrated-transformer/
https://colah.github.io/posts/2015-09-Visual-Information/
https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

深度学习(1)-梯度下降

梯度下降

梯度下降法的基本思想是沿着函数梯度的负方向移动,直到找到一个局部最小值点。

利用泰勒展开,我们可以得到

如果取$\epsilon = -\eta f’(x)$, 我们会有

因为$\eta f’^2(x)>0$, 当$\eta$足够小,我们会有.
则通过$x \leftarrow x - \eta f’(x)$可以使得$f(x)$的值下降。

牛顿法

牛顿法是一种基于二阶泰勒展开的优化方法,利用Hessian矩阵来加速收敛。

通过泰勒展开,我们有

$f(x)$为最小值时,我们有$f’(x)=0$,则有

随机梯度下降

对于梯度下降,如果是n个样本的训练数据集,我们就需要计算n次每个样本的损失梯度,
计算负担会随着n的升高而变重,因此就有了随机梯度下降,即每次计算梯度时,只是随机抽取
一个样本进行计算。
由于
可以认为平均而言,随机梯度是对梯度的良好估计。

小批量随机梯度下降

小批量随机梯度下降和随机梯度下降所用的所有元素都是从训练集中随机抽取的,因此梯度的期望保持不变,但是方差显著降低了。
一般来说,小批量随机梯度下降比随机梯度下降和梯度下降的速度快,收敛风险较小。

动量法

动量法是一种优化算法,它通过对过去的梯度进行加权平均,来改进梯度下降法的性能。动量法的基本思想是在每次迭代时,不仅考虑当前的梯度方向,还考虑过去梯度的方向,从而平滑了更新方向,有助于加速收敛并减少振荡。

AdaGrad

AdaGrad 的核心在于动态调整每个参数的学习率。每个参数的学习率是根据该参数的历史梯度的平方和来调整的。如果某个参数的历史梯度变化较大,那么它的学习率会相应减小;反之,如果某个参数的历史梯度变化较小,那么它的学习率会相应增大。

RMSProp

RMSprop(Root Mean Square Propagation)是一种自适应学习率优化算法,旨在克服AdaGrad等方法中学习率递减过快的问题。RMSprop通过使用指数加权移动平均(Exponential Moving Average, EMA)来计算梯度平方的平均值,从而动态调整每个参数的学习率。这种方法在一定程度上解决了AdaGrad累积历史梯度平方和导致学习率过早减小的问题。

Adadelta

Adam

Adam 结合了 AdaGrad 和 RMSprop 的优点,同时解决了它们各自的缺点。Adam 通过使用梯度的一阶矩估计(即动量)和二阶矩估计(即 RMSprop 中的梯度平方的指数加权移动平均值)来动态调整每个参数的学习率。

参考文献

https://zh.d2l.ai/chapter_optimization/gd.html