Optimization

Optimization

The solution of the function could be a local minimum, a local maximum, or a saddle point at a position where the function gradient is zero:

When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all positive, we have a local minimum for the function.

When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all negative, we have a local maximum for the function.

When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are negative and positive, we have a saddle point for the function.

references

https://d2l.ai/chapter_optimization/optimization-intro.html

Transformer

Transformer 是一种基于自注意力机制(Self-Attention Mechanism)的深度学习模型,最初由 Vaswani 等人在 2017 年的论文《Attention is All You Need》中提出。Transformer 在自然语言处理(NLP)任务中取得了显著的成果,特别是在机器翻译、文本生成和问答系统等领域。

Transformer 的计算流程

Transformer 的计算流程可以分为以下几个主要步骤:

  1. 输入嵌入(Input Embedding)
  2. 位置编码(Positional Encoding)
  3. 编码器(Encoder)
  4. 解码器(Decoder)
  5. 输出层(Output Layer)

1. 输入嵌入(Input Embedding)

  • 词嵌入(Word Embedding):将输入的单词转换为固定维度的向量。例如,如果词汇表大小为 V,嵌入维度为 d,那么词嵌入矩阵的形状为 (V, d)。
  • 公式
    [
    X = W_E \cdot I
    ]
    其中,(X) 是嵌入后的向量,(W_E) 是词嵌入矩阵,(I) 是输入的单词索引。

2. 位置编码(Positional Encoding)

  • 目的:由于 Transformer 模型没有循环结构,需要一种方式来引入顺序信息。位置编码通过添加一个与位置相关的向量来实现这一点。
  • 公式
    [
    PE{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
    ]
    [
    PE
    {(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
    ]
    其中,(pos) 是位置,(i) 是维度,(d) 是嵌入维度。

  • 操作:将位置编码向量与词嵌入向量相加:
    [
    X’ = X + PE
    ]

3. 编码器(Encoder)

  • 结构:编码器由多个相同的层堆叠而成,每层包含两个子层:

    1. 多头自注意力机制(Multi-Head Self-Attention)
    2. 前馈神经网络(Feed-Forward Neural Network)
  • 多头自注意力机制

    • 线性变换:将输入向量 (X’) 通过三个不同的线性变换(权重矩阵 (W_Q), (W_K), (W_V))得到查询向量 (Q)、键向量 (K) 和值向量 (V)。
    • 公式
      [
      Q = X’ \cdot W_Q, \quad K = X’ \cdot W_K, \quad V = X’ \cdot W_V
      ]
    • 自注意力计算
      [
      \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot V
      ]
      其中,(d_k) 是键向量的维度。
    • 多头机制:将自注意力机制分为多个头,每个头独立计算,然后将结果拼接起来,再通过一个线性变换:
      [
      \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W_O
      ]
      其中,(h) 是头的数量,(W_O) 是最终的线性变换矩阵。
  • 残差连接和层归一化

    • 残差连接:将输入直接加到输出上,以缓解梯度消失问题。
    • 层归一化:对每一层的输出进行归一化,以稳定训练过程。
    • 公式
      [
      \text{LayerNorm}(X + \text{Sublayer}(X))
      ]
  • 前馈神经网络

    • 结构:两层全连接网络,中间有激活函数(如 ReLU)。
    • 公式
      [
      \text{FFN}(X) = \text{Linear}(\text{ReLU}(\text{Linear}(X)))
      ]
    • 残差连接和层归一化
      [
      \text{LayerNorm}(X + \text{FFN}(X))
      ]

4. 解码器(Decoder)

  • 结构:解码器也由多个相同的层堆叠而成,每层包含三个子层:

    1. 掩码多头自注意力机制(Masked Multi-Head Self-Attention)
    2. 多头注意力机制(Multi-Head Attention)
    3. 前馈神经网络(Feed-Forward Neural Network)
  • 掩码多头自注意力机制

    • 目的:在生成输出时,防止当前位置看到未来的信息。
    • 掩码:在自注意力计算中,使用掩码矩阵 (M),使得当前位置不能看到未来的位置。
    • 公式
      [
      \text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right) \cdot V
      ]
  • 多头注意力机制

    • 目的:允许解码器关注编码器的输出。
    • 公式
      [
      \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot V
      ]
      其中,(Q) 来自解码器,(K) 和 (V) 来自编码器。
  • 前馈神经网络

    • 结构:与编码器中的前馈神经网络相同。
    • 公式
      [
      \text{FFN}(X) = \text{Linear}(\text{ReLU}(\text{Linear}(X)))
      ]
  • 残差连接和层归一化

    • 公式
      [
      \text{LayerNorm}(X + \text{Sublayer}(X))
      ]

5. 输出层(Output Layer)

  • 线性变换:将解码器的输出通过一个线性变换,映射到词汇表的大小。
  • 公式
    [
    Y = \text{Linear}(X)
    ]
  • softmax:将线性变换的结果通过 softmax 函数,得到每个词的概率分布。
  • 公式
    [
    P = \text{softmax}(Y)
    ]

总结

Transformer 的计算流程包括输入嵌入、位置编码、编码器、解码器和输出层。通过多头自注意力机制和前馈神经网络,Transformer 能够有效地捕捉长距离依赖关系,并在多种 NLP 任务中取得优异的性能。希望这些解释能帮助你更好地理解 Transformer 的计算流程。如果有更多问题或需要进一步的解释,请告诉我。

references

https://zhuanlan.zhihu.com/p/77307258
https://zhuanlan.zhihu.com/p/47812375

finetuning large language models

The 3 Conventional Feature-Based and Finetuning Approaches

Feature-Based Approach

In the feature-based approach, we load a pretrained LLM and apply it to our target dataset. Here, we are particularly interested in generating the output embeddings for the training set, which we can use as input features to train a classification model.

Finetuning I – Updating The Output Layers

A popular approach related to the feature-based approach described above is finetuning the output layers (we will refer to this approach as finetuning I). Similar to the feature-based approach, we keep the parameters of the pretrained LLM frozen. We only train the newly added output layers.

Finetuning II – Updating All Layers

when optimizing the modeling performance, the gold standard for using pretrained LLMs is to update all layers.

parameter-efficient finetuning techniques (PEFT)

To finetune LLM with high modeling performance while only requiring the training of only a small number of parameters. These methods are usually referred to as parameter-efficient finetuning techniques (PEFT). Techniques such as prefix tuning, adapters, and low-rank adaptation, all of which “modify” multiple layers, achieve much better predictive performance (at a low cost).

Reinforcement Learning with Human Feedback (RLHF)

In RLHF, human feedback is collected by having humans rank or rate different model outputs, providing a reward signal. The collected reward labels can then be used to train a reward model that is then in turn used to guide the LLMs adaptation to human preferences.

The reward model itself is learned via supervised learning (typically using a pretrained LLM as base model). Next, the reward model is used to update the pretrained LLM that is to be adapted to human preferences — the training uses a flavor of reinforcement learning called proximal policy optimization.

prompt tuning

In a nutshell, prompt tuning (different from prompting) appends a tensor to the embedded inputs of a pretrained LLM. The tensor is then tuned to optimize a loss function for the finetuning task and data while all other parameters in the LLM remain frozen.

The main idea behind prompt tuning, and parameter-efficient finetuning methods in general, is to add a small number of new parameters to a pretrained LLM and only finetune the newly added parameters to make the LLM perform better on (a) a target dataset (for example, a domain-specific dataset like medical or legal documents) and (b) a target task (for example, sentiment classification).

references

https://magazine.sebastianraschka.com/p/finetuning-large-language-models
https://magazine.sebastianraschka.com/p/understanding-parameter-efficient
https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters

Batch processing for sequences

padding

In natural language processing (NLP), padding refers to the practice of adding special tokens to sequences (such as sentences or texts) so that all sequences in a batch have the same length. Padding is essential when working with mini-batch processing in neural networks because it ensures that all sequences in a batch can be processed simultaneously, despite their varying lengths.

Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to.

references

https://huggingface.co/learn/nlp-course/en/chapter2/5?fw=pt

Tokenizers

what is tokenizer

A tokenizer is a crucial component in natural language processing (NLP) and text analysis that breaks down text into smaller, manageable units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements depending on the specific requirements of the application.

how tokenizer works

There are different types of tokenizer methods.Whitespace Tokenizers, Punctuation-Based Tokenizers, Word Tokenizers,Sentence Tokenizers,Character Tokenizers, N-gram Tokenizers, Regular Expression Tokenizers and
Subword Tokenizers.

Word Tokenizers

Word tokenization, also known as lexical analysis, is the process of splitting a piece of text into individual words or tokens. Word tokenization typically involves breaking the text into words based on spaces and punctuation.

Subword Tokenizers

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. A subword tokenizer is a type of tokenizer used in natural language processing (NLP) that breaks down words into smaller units or subwords. This approach is particularly useful for handling rare or out-of-vocabulary words, reducing the vocabulary size, and improving the efficiency of language models.

Common Subword Tokenization Methods

Byte-Pair Encoding (BPE)

BPE is an iterative algorithm that merges the most frequent pairs of characters or subwords in a corpus until a desired vocabulary size is reached.

WordPiece Tokenization

Similar to BPE, WordPiece builds a vocabulary of subwords based on frequency, optimizing for a balance between vocabulary size and the ability to handle rare words.

SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly designed for Neural Network-based text generation systems. It treats the input text as a sequence of Unicode characters and uses a subword model to create subwords.

references

https://huggingface.co/learn/nlp-course/en/chapter2/4

Natural Language Inference(Recognizing Textual Entailment)

definition

Natural language inference (NLI) is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

benchmarks

Benchmark datasets used for NLI include SNLI, MultiNLI, SciTail, SuperGLUE, RTE, WNLI.

Problems record of using OpenAI's API

GPT

gpt-3.5-turbo-instruct generates empty text after calling several times.

Tried adding space or adding newline, but didn’t work.

gpt-3.5-turbo-1106 generates different results from same prompt even though T is set as 0.

Tried setting seed but did’t work. Switched to another version mitigated this problem.

Sampling

Sampling

top-p sampling

This method only considers the tokens whose cumulative probability exceed the probability p and then redistributes the probability mass across the remaining tokens so that the sum of probabilities is 1.

temperature

What the temperature does is: it controls the relative weights in the probability distribution. It controls the extent to which differences in probability play a role in the sampling. At temperature t=0 this sampling technique turns into what we call greedy search/argmax sampling where the token with the highest probability is always selected.

reference

https://blog.ml6.eu/why-openais-api-models-cannot-be-forced-to-behave-fully-deterministically-4934a7e8f184