词嵌入

word2vec

word2vec工具包含两个模型,即跳元模型(skip-gram)和连续词袋(CBOW (Continuous Bag-of-Words)。Skip-gram是给定一个中心词,预测其周围的上下文词。CBOW是给定一段文本中的一个中心词周围的上下文词,预测中心词。

负采样

在标准的 softmax 方法中,我们有

这涉及到计算所有词汇表中词汇的条件概率, 在词汇表非常大的时候是非常耗时的。

负采样的核心思想是通过随机选择一部分非上下文词(负样本)来近似这个条件概率分布。具体来说,对于每个中心词和对应的上下文词,我们不仅更新这两个词的向量表示,而且还随机选择若干个不在上下文中的词作为“负样本”,并更新它们的向量表示。在负采样中,我们采用一个简化的逻辑回归目标函数来替代标准的 softmax 函数。对于每个训练样本,我们希望模型能够正确区分真正的上下文词和负样本词。因此,我们可以定义一个目标函数,它希望对于上下文词,有$\sigma(\u_o\v_i+b)=1$而对于每个负样本,有$\sigma(\u_o\v_i+b)=0$。 通过这种方式,负采样实际上是在模拟原始 softmax 的行为,但它只考虑了一小部分词汇,即一个正样本加上几个负样本。这意味着负采样通过一系列独立的二元分类任务来近似这个条件概率。

层次softmax

层次softmax 的核心思想是构建一个词汇的层次结构,通常是一棵哈夫曼树(Huffman Tree),并将分类问题转化为沿着树的路径进行的一系列二元分类问题。这样可以将原始的 softmax 层的计算复杂度从线性减少到对数级别。层次softmax 的目标是通过计算从根节点到词汇的叶子节点的路径概率来近似原条件概率。

GloVe

GloVe模型的主要特点是它试图捕捉单词之间的共现频率信息,即一个单词出现时另一个单词出现的概率。通过这种方式,模型可以学习到词汇语义以及词汇间的关系,比如同义词、反义词、以及类比关系等。

训练GloVe模型涉及到最小化词对共现概率预测值与实际共现概率之间的损失函数。

子词嵌入

fastText

在跳元模型和连续词袋模型中,同一词的不同变形形式直接由不同的向量表示,不需要共享参数。为了使用形态信息,fastText模型提出了一种子词嵌入方法,其中子词是一个字符n-gram。fastText可以被认为是子词级跳元模型,而非学习词级向量表示,其中每个中心词由其子词级向量之和表示。

字节对编码(Byte Pair Encoding)

字节对编码(Byte Pair Encoding,简称 BPE)是一种用于词汇归一化和文本压缩的技术,近年来,BPE 被重新引入到自然语言处理领域,特别是在机器翻译和语言建模中,作为一种生成子词单位的有效方法。BPE 的基本思想是不断地合并最常出现的相邻字符对,直到达到预定的词汇表大小。

ELMo

word2vec和GloVe都将相同的预训练向量分配给同一个词,而不考虑词的上下文,考虑到自然语言中丰富的多义现象和复杂的语义,上下文无关表示具有明显的局限性,同一个词可以根据上下文被赋予不同的表示。

ELMo(Embeddings from Language Models)是一种上下文敏感的词嵌入方法,使用双向 LSTM(长短期记忆网络)构建深度语言模型,这种模型可以捕获来自句子左右两边的信息。通过训练这种模型,可以得到一个对上下文敏感的词嵌入。

GPT

初代 GPT 基于 Transformer 架构,使用的是单向的 Transformer,这意味着它在生成文本时只能访问之前的位置信息,而不能访问当前位置之后的信息。通过自回归方式训练,即模型学习给定前面的文字后预测下一个文字。

BERT

ELMo对上下文进行双向编码,但使用特定于任务的架构;而GPT是任务无关的,但是从左到右编码上下文,BERT结合了这两个方面的优点。

BERT输入序列明确地表示单个文本和文本对。当输入为单个文本时,BERT输入序列是特殊类别词元“\”、文本序列的标记、以及特殊分隔词元“\”的连结。当输入为文本对时,BERT输入序列是“\”、第一个文本序列的标记、“\”、第二个文本序列标记、以及“\”的连结。

为了双向编码上下文以表示每个词元,BERT随机掩蔽词元并使用来自双向上下文的词元以自监督的方式预测掩蔽词元。此任务称为掩蔽语言模型。

尽管掩蔽语言建模能够编码双向上下文来表示单词,但它不能显式地建模文本对之间的逻辑关系。为了帮助理解两个文本序列之间的关系,BERT在预训练中考虑了一个二元分类任务——下一句预测。在为预训练生成句子对时,有一半的时间它们确实是标签为“真”的连续句子;在另一半的时间里,第二个句子是从语料库中随机抽取的,标记为“假”。

在预训练BERT时,最终的损失函数是掩蔽语言模型损失函数和下一句预测损失函数的线性组合。

参考文献

https://zh.d2l.ai/chapter_natural-language-processing-pretraining/word2vec.html

https://zh.d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html

Optimization

Optimization

The solution of the function could be a local minimum, a local maximum, or a saddle point at a position where the function gradient is zero:

When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all positive, we have a local minimum for the function.

When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all negative, we have a local maximum for the function.

When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are negative and positive, we have a saddle point for the function.

references

https://d2l.ai/chapter_optimization/optimization-intro.html

Transformer

Transformer 是一种基于自注意力机制(Self-Attention Mechanism)的深度学习模型,最初由 Vaswani 等人在 2017 年的论文《Attention is All You Need》中提出。Transformer 在自然语言处理(NLP)任务中取得了显著的成果,特别是在机器翻译、文本生成和问答系统等领域。

Transformer 的计算流程

Transformer 的计算流程可以分为以下几个主要步骤:

  1. 输入嵌入(Input Embedding)
  2. 位置编码(Positional Encoding)
  3. 编码器(Encoder)
  4. 解码器(Decoder)
  5. 输出层(Output Layer)

1. 输入嵌入(Input Embedding)

  • 词嵌入(Word Embedding):将输入的单词转换为固定维度的向量。例如,如果词汇表大小为 V,嵌入维度为 d,那么词嵌入矩阵的形状为 (V, d)。
  • 公式
    [
    X = W_E \cdot I
    ]
    其中,(X) 是嵌入后的向量,(W_E) 是词嵌入矩阵,(I) 是输入的单词索引。

2. 位置编码(Positional Encoding)

  • 目的:由于 Transformer 模型没有循环结构,需要一种方式来引入顺序信息。位置编码通过添加一个与位置相关的向量来实现这一点。
  • 公式
    [
    PE{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
    ]
    [
    PE
    {(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
    ]
    其中,(pos) 是位置,(i) 是维度,(d) 是嵌入维度。

  • 操作:将位置编码向量与词嵌入向量相加:
    [
    X’ = X + PE
    ]

3. 编码器(Encoder)

  • 结构:编码器由多个相同的层堆叠而成,每层包含两个子层:

    1. 多头自注意力机制(Multi-Head Self-Attention)
    2. 前馈神经网络(Feed-Forward Neural Network)
  • 多头自注意力机制

    • 线性变换:将输入向量 (X’) 通过三个不同的线性变换(权重矩阵 (W_Q), (W_K), (W_V))得到查询向量 (Q)、键向量 (K) 和值向量 (V)。
    • 公式
      [
      Q = X’ \cdot W_Q, \quad K = X’ \cdot W_K, \quad V = X’ \cdot W_V
      ]
    • 自注意力计算
      [
      \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot V
      ]
      其中,(d_k) 是键向量的维度。
    • 多头机制:将自注意力机制分为多个头,每个头独立计算,然后将结果拼接起来,再通过一个线性变换:
      [
      \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W_O
      ]
      其中,(h) 是头的数量,(W_O) 是最终的线性变换矩阵。
  • 残差连接和层归一化

    • 残差连接:将输入直接加到输出上,以缓解梯度消失问题。
    • 层归一化:对每一层的输出进行归一化,以稳定训练过程。
    • 公式
      [
      \text{LayerNorm}(X + \text{Sublayer}(X))
      ]
  • 前馈神经网络

    • 结构:两层全连接网络,中间有激活函数(如 ReLU)。
    • 公式
      [
      \text{FFN}(X) = \text{Linear}(\text{ReLU}(\text{Linear}(X)))
      ]
    • 残差连接和层归一化
      [
      \text{LayerNorm}(X + \text{FFN}(X))
      ]

4. 解码器(Decoder)

  • 结构:解码器也由多个相同的层堆叠而成,每层包含三个子层:

    1. 掩码多头自注意力机制(Masked Multi-Head Self-Attention)
    2. 多头注意力机制(Multi-Head Attention)
    3. 前馈神经网络(Feed-Forward Neural Network)
  • 掩码多头自注意力机制

    • 目的:在生成输出时,防止当前位置看到未来的信息。
    • 掩码:在自注意力计算中,使用掩码矩阵 (M),使得当前位置不能看到未来的位置。
    • 公式
      [
      \text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right) \cdot V
      ]
  • 多头注意力机制

    • 目的:允许解码器关注编码器的输出。
    • 公式
      [
      \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot V
      ]
      其中,(Q) 来自解码器,(K) 和 (V) 来自编码器。
  • 前馈神经网络

    • 结构:与编码器中的前馈神经网络相同。
    • 公式
      [
      \text{FFN}(X) = \text{Linear}(\text{ReLU}(\text{Linear}(X)))
      ]
  • 残差连接和层归一化

    • 公式
      [
      \text{LayerNorm}(X + \text{Sublayer}(X))
      ]

5. 输出层(Output Layer)

  • 线性变换:将解码器的输出通过一个线性变换,映射到词汇表的大小。
  • 公式
    [
    Y = \text{Linear}(X)
    ]
  • softmax:将线性变换的结果通过 softmax 函数,得到每个词的概率分布。
  • 公式
    [
    P = \text{softmax}(Y)
    ]

总结

Transformer 的计算流程包括输入嵌入、位置编码、编码器、解码器和输出层。通过多头自注意力机制和前馈神经网络,Transformer 能够有效地捕捉长距离依赖关系,并在多种 NLP 任务中取得优异的性能。希望这些解释能帮助你更好地理解 Transformer 的计算流程。如果有更多问题或需要进一步的解释,请告诉我。

references

https://zhuanlan.zhihu.com/p/77307258
https://zhuanlan.zhihu.com/p/47812375

finetuning large language models

The 3 Conventional Feature-Based and Finetuning Approaches

Feature-Based Approach

In the feature-based approach, we load a pretrained LLM and apply it to our target dataset. Here, we are particularly interested in generating the output embeddings for the training set, which we can use as input features to train a classification model.

Finetuning I – Updating The Output Layers

A popular approach related to the feature-based approach described above is finetuning the output layers (we will refer to this approach as finetuning I). Similar to the feature-based approach, we keep the parameters of the pretrained LLM frozen. We only train the newly added output layers.

Finetuning II – Updating All Layers

when optimizing the modeling performance, the gold standard for using pretrained LLMs is to update all layers.

parameter-efficient finetuning techniques (PEFT)

To finetune LLM with high modeling performance while only requiring the training of only a small number of parameters. These methods are usually referred to as parameter-efficient finetuning techniques (PEFT). Techniques such as prefix tuning, adapters, and low-rank adaptation, all of which “modify” multiple layers, achieve much better predictive performance (at a low cost).

Reinforcement Learning with Human Feedback (RLHF)

In RLHF, human feedback is collected by having humans rank or rate different model outputs, providing a reward signal. The collected reward labels can then be used to train a reward model that is then in turn used to guide the LLMs adaptation to human preferences.

The reward model itself is learned via supervised learning (typically using a pretrained LLM as base model). Next, the reward model is used to update the pretrained LLM that is to be adapted to human preferences — the training uses a flavor of reinforcement learning called proximal policy optimization.

prompt tuning

In a nutshell, prompt tuning (different from prompting) appends a tensor to the embedded inputs of a pretrained LLM. The tensor is then tuned to optimize a loss function for the finetuning task and data while all other parameters in the LLM remain frozen.

The main idea behind prompt tuning, and parameter-efficient finetuning methods in general, is to add a small number of new parameters to a pretrained LLM and only finetune the newly added parameters to make the LLM perform better on (a) a target dataset (for example, a domain-specific dataset like medical or legal documents) and (b) a target task (for example, sentiment classification).

references

https://magazine.sebastianraschka.com/p/finetuning-large-language-models
https://magazine.sebastianraschka.com/p/understanding-parameter-efficient
https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters

Batch processing for sequences

padding

In natural language processing (NLP), padding refers to the practice of adding special tokens to sequences (such as sentences or texts) so that all sequences in a batch have the same length. Padding is essential when working with mini-batch processing in neural networks because it ensures that all sequences in a batch can be processed simultaneously, despite their varying lengths.

Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to.

references

https://huggingface.co/learn/nlp-course/en/chapter2/5?fw=pt

Tokenizers

what is tokenizer

A tokenizer is a crucial component in natural language processing (NLP) and text analysis that breaks down text into smaller, manageable units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements depending on the specific requirements of the application.

how tokenizer works

There are different types of tokenizer methods.Whitespace Tokenizers, Punctuation-Based Tokenizers, Word Tokenizers,Sentence Tokenizers,Character Tokenizers, N-gram Tokenizers, Regular Expression Tokenizers and
Subword Tokenizers.

Word Tokenizers

Word tokenization, also known as lexical analysis, is the process of splitting a piece of text into individual words or tokens. Word tokenization typically involves breaking the text into words based on spaces and punctuation.

Subword Tokenizers

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. A subword tokenizer is a type of tokenizer used in natural language processing (NLP) that breaks down words into smaller units or subwords. This approach is particularly useful for handling rare or out-of-vocabulary words, reducing the vocabulary size, and improving the efficiency of language models.

Common Subword Tokenization Methods

Byte-Pair Encoding (BPE)

BPE is an iterative algorithm that merges the most frequent pairs of characters or subwords in a corpus until a desired vocabulary size is reached.

WordPiece Tokenization

Similar to BPE, WordPiece builds a vocabulary of subwords based on frequency, optimizing for a balance between vocabulary size and the ability to handle rare words.

SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly designed for Neural Network-based text generation systems. It treats the input text as a sequence of Unicode characters and uses a subword model to create subwords.

references

https://huggingface.co/learn/nlp-course/en/chapter2/4

Sampling

Sampling

top-p sampling

This method only considers the tokens whose cumulative probability exceed the probability p and then redistributes the probability mass across the remaining tokens so that the sum of probabilities is 1.

temperature

What the temperature does is: it controls the relative weights in the probability distribution. It controls the extent to which differences in probability play a role in the sampling. At temperature t=0 this sampling technique turns into what we call greedy search/argmax sampling where the token with the highest probability is always selected.

reference

https://blog.ml6.eu/why-openais-api-models-cannot-be-forced-to-behave-fully-deterministically-4934a7e8f184

Measuring sentence similarity

metrics

BLEU (Bilingual Evaluation Understudy)

BLEU computes a score based on the n-gram overlap between the generated text and the reference text, as well as the brevity penalty to handle cases where the generated text is too short. The score ranges from 0 to 1, where 1 indicates a perfect match with the reference translations.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE score measures the similarity between the machine-generated summary and the reference summaries using overlapping n-grams, word sequences that appear in both the machine-generated summary and the reference summaries. ROUGE score ranges from 0 to 1, with higher values indicating better summary quality.

ROUGE scores are branched into ROUGE-N,ROUGE-L, and ROUGE-S.
ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the n-gram overlap.
ROUGE-L measures the longest common subsequence (LCS) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the length of the LCS.
ROUGE-S measures the skip-bigram (bi-gram with at most one intervening word) overlap between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the skip-bigram overlap.

references

https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb