词嵌入

word2vec

word2vec工具包含两个模型,即跳元模型(skip-gram)和连续词袋(CBOW (Continuous Bag-of-Words)。Skip-gram是给定一个中心词,预测其周围的上下文词。CBOW是给定一段文本中的一个中心词周围的上下文词,预测中心词。

负采样

在标准的 softmax 方法中,我们有

这涉及到计算所有词汇表中词汇的条件概率, 在词汇表非常大的时候是非常耗时的。

负采样的核心思想是通过随机选择一部分非上下文词(负样本)来近似这个条件概率分布。具体来说,对于每个中心词和对应的上下文词,我们不仅更新这两个词的向量表示,而且还随机选择若干个不在上下文中的词作为“负样本”,并更新它们的向量表示。在负采样中,我们采用一个简化的逻辑回归目标函数来替代标准的 softmax 函数。对于每个训练样本,我们希望模型能够正确区分真正的上下文词和负样本词。因此,我们可以定义一个目标函数,它希望对于上下文词,有$\sigma(\u_o\v_i+b)=1$而对于每个负样本,有$\sigma(\u_o\v_i+b)=0$。 通过这种方式,负采样实际上是在模拟原始 softmax 的行为,但它只考虑了一小部分词汇,即一个正样本加上几个负样本。这意味着负采样通过一系列独立的二元分类任务来近似这个条件概率。

层次softmax

层次softmax 的核心思想是构建一个词汇的层次结构,通常是一棵哈夫曼树(Huffman Tree),并将分类问题转化为沿着树的路径进行的一系列二元分类问题。这样可以将原始的 softmax 层的计算复杂度从线性减少到对数级别。层次softmax 的目标是通过计算从根节点到词汇的叶子节点的路径概率来近似原条件概率。

GloVe

GloVe模型的主要特点是它试图捕捉单词之间的共现频率信息,即一个单词出现时另一个单词出现的概率。通过这种方式,模型可以学习到词汇语义以及词汇间的关系,比如同义词、反义词、以及类比关系等。

训练GloVe模型涉及到最小化词对共现概率预测值与实际共现概率之间的损失函数。

子词嵌入

fastText

在跳元模型和连续词袋模型中,同一词的不同变形形式直接由不同的向量表示,不需要共享参数。为了使用形态信息,fastText模型提出了一种子词嵌入方法,其中子词是一个字符n-gram。fastText可以被认为是子词级跳元模型,而非学习词级向量表示,其中每个中心词由其子词级向量之和表示。

字节对编码(Byte Pair Encoding)

字节对编码(Byte Pair Encoding,简称 BPE)是一种用于词汇归一化和文本压缩的技术,近年来,BPE 被重新引入到自然语言处理领域,特别是在机器翻译和语言建模中,作为一种生成子词单位的有效方法。BPE 的基本思想是不断地合并最常出现的相邻字符对,直到达到预定的词汇表大小。

ELMo

word2vec和GloVe都将相同的预训练向量分配给同一个词,而不考虑词的上下文,考虑到自然语言中丰富的多义现象和复杂的语义,上下文无关表示具有明显的局限性,同一个词可以根据上下文被赋予不同的表示。

ELMo(Embeddings from Language Models)是一种上下文敏感的词嵌入方法,使用双向 LSTM(长短期记忆网络)构建深度语言模型,这种模型可以捕获来自句子左右两边的信息。通过训练这种模型,可以得到一个对上下文敏感的词嵌入。

GPT

初代 GPT 基于 Transformer 架构,使用的是单向的 Transformer,这意味着它在生成文本时只能访问之前的位置信息,而不能访问当前位置之后的信息。通过自回归方式训练,即模型学习给定前面的文字后预测下一个文字。

BERT

ELMo对上下文进行双向编码,但使用特定于任务的架构;而GPT是任务无关的,但是从左到右编码上下文,BERT结合了这两个方面的优点。

BERT输入序列明确地表示单个文本和文本对。当输入为单个文本时,BERT输入序列是特殊类别词元“\”、文本序列的标记、以及特殊分隔词元“\”的连结。当输入为文本对时,BERT输入序列是“\”、第一个文本序列的标记、“\”、第二个文本序列标记、以及“\”的连结。

为了双向编码上下文以表示每个词元,BERT随机掩蔽词元并使用来自双向上下文的词元以自监督的方式预测掩蔽词元。此任务称为掩蔽语言模型。

尽管掩蔽语言建模能够编码双向上下文来表示单词,但它不能显式地建模文本对之间的逻辑关系。为了帮助理解两个文本序列之间的关系,BERT在预训练中考虑了一个二元分类任务——下一句预测。在为预训练生成句子对时,有一半的时间它们确实是标签为“真”的连续句子;在另一半的时间里,第二个句子是从语料库中随机抽取的,标记为“假”。

在预训练BERT时,最终的损失函数是掩蔽语言模型损失函数和下一句预测损失函数的线性组合。

参考文献

https://zh.d2l.ai/chapter_natural-language-processing-pretraining/word2vec.html

https://zh.d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html

finetuning large language models

The 3 Conventional Feature-Based and Finetuning Approaches

Feature-Based Approach

In the feature-based approach, we load a pretrained LLM and apply it to our target dataset. Here, we are particularly interested in generating the output embeddings for the training set, which we can use as input features to train a classification model.

Finetuning I – Updating The Output Layers

A popular approach related to the feature-based approach described above is finetuning the output layers (we will refer to this approach as finetuning I). Similar to the feature-based approach, we keep the parameters of the pretrained LLM frozen. We only train the newly added output layers.

Finetuning II – Updating All Layers

when optimizing the modeling performance, the gold standard for using pretrained LLMs is to update all layers.

parameter-efficient finetuning techniques (PEFT)

To finetune LLM with high modeling performance while only requiring the training of only a small number of parameters. These methods are usually referred to as parameter-efficient finetuning techniques (PEFT). Techniques such as prefix tuning, adapters, and low-rank adaptation, all of which “modify” multiple layers, achieve much better predictive performance (at a low cost).

Reinforcement Learning with Human Feedback (RLHF)

In RLHF, human feedback is collected by having humans rank or rate different model outputs, providing a reward signal. The collected reward labels can then be used to train a reward model that is then in turn used to guide the LLMs adaptation to human preferences.

The reward model itself is learned via supervised learning (typically using a pretrained LLM as base model). Next, the reward model is used to update the pretrained LLM that is to be adapted to human preferences — the training uses a flavor of reinforcement learning called proximal policy optimization.

prompt tuning

In a nutshell, prompt tuning (different from prompting) appends a tensor to the embedded inputs of a pretrained LLM. The tensor is then tuned to optimize a loss function for the finetuning task and data while all other parameters in the LLM remain frozen.

The main idea behind prompt tuning, and parameter-efficient finetuning methods in general, is to add a small number of new parameters to a pretrained LLM and only finetune the newly added parameters to make the LLM perform better on (a) a target dataset (for example, a domain-specific dataset like medical or legal documents) and (b) a target task (for example, sentiment classification).

references

https://magazine.sebastianraschka.com/p/finetuning-large-language-models
https://magazine.sebastianraschka.com/p/understanding-parameter-efficient
https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters

Batch processing for sequences

padding

In natural language processing (NLP), padding refers to the practice of adding special tokens to sequences (such as sentences or texts) so that all sequences in a batch have the same length. Padding is essential when working with mini-batch processing in neural networks because it ensures that all sequences in a batch can be processed simultaneously, despite their varying lengths.

Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to.

references

https://huggingface.co/learn/nlp-course/en/chapter2/5?fw=pt

Tokenizers

what is tokenizer

A tokenizer is a crucial component in natural language processing (NLP) and text analysis that breaks down text into smaller, manageable units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements depending on the specific requirements of the application.

how tokenizer works

There are different types of tokenizer methods.Whitespace Tokenizers, Punctuation-Based Tokenizers, Word Tokenizers,Sentence Tokenizers,Character Tokenizers, N-gram Tokenizers, Regular Expression Tokenizers and
Subword Tokenizers.

Word Tokenizers

Word tokenization, also known as lexical analysis, is the process of splitting a piece of text into individual words or tokens. Word tokenization typically involves breaking the text into words based on spaces and punctuation.

Subword Tokenizers

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. A subword tokenizer is a type of tokenizer used in natural language processing (NLP) that breaks down words into smaller units or subwords. This approach is particularly useful for handling rare or out-of-vocabulary words, reducing the vocabulary size, and improving the efficiency of language models.

Common Subword Tokenization Methods

Byte-Pair Encoding (BPE)

BPE is an iterative algorithm that merges the most frequent pairs of characters or subwords in a corpus until a desired vocabulary size is reached.

WordPiece Tokenization

Similar to BPE, WordPiece builds a vocabulary of subwords based on frequency, optimizing for a balance between vocabulary size and the ability to handle rare words.

SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly designed for Neural Network-based text generation systems. It treats the input text as a sequence of Unicode characters and uses a subword model to create subwords.

references

https://huggingface.co/learn/nlp-course/en/chapter2/4

Natural Language Inference(Recognizing Textual Entailment)

definition

Natural language inference (NLI) is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

benchmarks

Benchmark datasets used for NLI include SNLI, MultiNLI, SciTail, SuperGLUE, RTE, WNLI.

Problems record of using OpenAI's API

GPT

gpt-3.5-turbo-instruct generates empty text after calling several times.

Tried adding space or adding newline, but didn’t work.

gpt-3.5-turbo-1106 generates different results from same prompt even though T is set as 0.

Tried setting seed but did’t work. Switched to another version mitigated this problem.

Sampling

Sampling

top-p sampling

This method only considers the tokens whose cumulative probability exceed the probability p and then redistributes the probability mass across the remaining tokens so that the sum of probabilities is 1.

temperature

What the temperature does is: it controls the relative weights in the probability distribution. It controls the extent to which differences in probability play a role in the sampling. At temperature t=0 this sampling technique turns into what we call greedy search/argmax sampling where the token with the highest probability is always selected.

reference

https://blog.ml6.eu/why-openais-api-models-cannot-be-forced-to-behave-fully-deterministically-4934a7e8f184

Measuring sentence similarity

metrics

BLEU (Bilingual Evaluation Understudy)

BLEU computes a score based on the n-gram overlap between the generated text and the reference text, as well as the brevity penalty to handle cases where the generated text is too short. The score ranges from 0 to 1, where 1 indicates a perfect match with the reference translations.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE score measures the similarity between the machine-generated summary and the reference summaries using overlapping n-grams, word sequences that appear in both the machine-generated summary and the reference summaries. ROUGE score ranges from 0 to 1, with higher values indicating better summary quality.

ROUGE scores are branched into ROUGE-N,ROUGE-L, and ROUGE-S.
ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the n-gram overlap.
ROUGE-L measures the longest common subsequence (LCS) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the length of the LCS.
ROUGE-S measures the skip-bigram (bi-gram with at most one intervening word) overlap between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the skip-bigram overlap.

references

https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb