finetuning large language models

The 3 Conventional Feature-Based and Finetuning Approaches

Feature-Based Approach

In the feature-based approach, we load a pretrained LLM and apply it to our target dataset. Here, we are particularly interested in generating the output embeddings for the training set, which we can use as input features to train a classification model.

Finetuning I – Updating The Output Layers

A popular approach related to the feature-based approach described above is finetuning the output layers (we will refer to this approach as finetuning I). Similar to the feature-based approach, we keep the parameters of the pretrained LLM frozen. We only train the newly added output layers.

Finetuning II – Updating All Layers

when optimizing the modeling performance, the gold standard for using pretrained LLMs is to update all layers.

parameter-efficient finetuning techniques (PEFT)

To finetune LLM with high modeling performance while only requiring the training of only a small number of parameters. These methods are usually referred to as parameter-efficient finetuning techniques (PEFT). Techniques such as prefix tuning, adapters, and low-rank adaptation, all of which “modify” multiple layers, achieve much better predictive performance (at a low cost).

Reinforcement Learning with Human Feedback (RLHF)

In RLHF, human feedback is collected by having humans rank or rate different model outputs, providing a reward signal. The collected reward labels can then be used to train a reward model that is then in turn used to guide the LLMs adaptation to human preferences.

The reward model itself is learned via supervised learning (typically using a pretrained LLM as base model). Next, the reward model is used to update the pretrained LLM that is to be adapted to human preferences — the training uses a flavor of reinforcement learning called proximal policy optimization.

prompt tuning

In a nutshell, prompt tuning (different from prompting) appends a tensor to the embedded inputs of a pretrained LLM. The tensor is then tuned to optimize a loss function for the finetuning task and data while all other parameters in the LLM remain frozen.

The main idea behind prompt tuning, and parameter-efficient finetuning methods in general, is to add a small number of new parameters to a pretrained LLM and only finetune the newly added parameters to make the LLM perform better on (a) a target dataset (for example, a domain-specific dataset like medical or legal documents) and (b) a target task (for example, sentiment classification).

references

https://magazine.sebastianraschka.com/p/finetuning-large-language-models
https://magazine.sebastianraschka.com/p/understanding-parameter-efficient
https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters

Batch processing for sequences

padding

In natural language processing (NLP), padding refers to the practice of adding special tokens to sequences (such as sentences or texts) so that all sequences in a batch have the same length. Padding is essential when working with mini-batch processing in neural networks because it ensures that all sequences in a batch can be processed simultaneously, despite their varying lengths.

Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to.

references

https://huggingface.co/learn/nlp-course/en/chapter2/5?fw=pt

Tokenizers

what is tokenizer

A tokenizer is a crucial component in natural language processing (NLP) and text analysis that breaks down text into smaller, manageable units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements depending on the specific requirements of the application.

how tokenizer works

There are different types of tokenizer methods.Whitespace Tokenizers, Punctuation-Based Tokenizers, Word Tokenizers,Sentence Tokenizers,Character Tokenizers, N-gram Tokenizers, Regular Expression Tokenizers and
Subword Tokenizers.

Word Tokenizers

Word tokenization, also known as lexical analysis, is the process of splitting a piece of text into individual words or tokens. Word tokenization typically involves breaking the text into words based on spaces and punctuation.

Subword Tokenizers

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. A subword tokenizer is a type of tokenizer used in natural language processing (NLP) that breaks down words into smaller units or subwords. This approach is particularly useful for handling rare or out-of-vocabulary words, reducing the vocabulary size, and improving the efficiency of language models.

Common Subword Tokenization Methods

Byte-Pair Encoding (BPE)

BPE is an iterative algorithm that merges the most frequent pairs of characters or subwords in a corpus until a desired vocabulary size is reached.

WordPiece Tokenization

Similar to BPE, WordPiece builds a vocabulary of subwords based on frequency, optimizing for a balance between vocabulary size and the ability to handle rare words.

SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly designed for Neural Network-based text generation systems. It treats the input text as a sequence of Unicode characters and uses a subword model to create subwords.

references

https://huggingface.co/learn/nlp-course/en/chapter2/4

Natural Language Inference(Recognizing Textual Entailment)

definition

Natural language inference (NLI) is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

benchmarks

Benchmark datasets used for NLI include SNLI, MultiNLI, SciTail, SuperGLUE, RTE, WNLI.

Problems record of using OpenAI's API

GPT

gpt-3.5-turbo-instruct generates empty text after calling several times.

Tried adding space or adding newline, but didn’t work.

gpt-3.5-turbo-1106 generates different results from same prompt even though T is set as 0.

Tried setting seed but did’t work. Switched to another version mitigated this problem.

Sampling

Sampling

top-p sampling

This method only considers the tokens whose cumulative probability exceed the probability p and then redistributes the probability mass across the remaining tokens so that the sum of probabilities is 1.

temperature

What the temperature does is: it controls the relative weights in the probability distribution. It controls the extent to which differences in probability play a role in the sampling. At temperature t=0 this sampling technique turns into what we call greedy search/argmax sampling where the token with the highest probability is always selected.

reference

https://blog.ml6.eu/why-openais-api-models-cannot-be-forced-to-behave-fully-deterministically-4934a7e8f184

Measuring sentence similarity

metrics

BLEU (Bilingual Evaluation Understudy)

BLEU computes a score based on the n-gram overlap between the generated text and the reference text, as well as the brevity penalty to handle cases where the generated text is too short. The score ranges from 0 to 1, where 1 indicates a perfect match with the reference translations.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE score measures the similarity between the machine-generated summary and the reference summaries using overlapping n-grams, word sequences that appear in both the machine-generated summary and the reference summaries. ROUGE score ranges from 0 to 1, with higher values indicating better summary quality.

ROUGE scores are branched into ROUGE-N,ROUGE-L, and ROUGE-S.
ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the n-gram overlap.
ROUGE-L measures the longest common subsequence (LCS) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the length of the LCS.
ROUGE-S measures the skip-bigram (bi-gram with at most one intervening word) overlap between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the skip-bigram overlap.

references

https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb

Regular Expressions

Regular expressions

A formal language for specifying text strings

rules

Disjunctions:
Letters inside square brackets[]: [A-Z]
pipe |: a|b|c
Negation in Disjunction: Ss

?: When placed after a character or a group, the question mark makes it optional, meaning that the character or group can occur zero or one time.
When placed after a quantifier, such as , +, or ?, it modifies the quantifier to be non-greedy or lazy. A non-greedy quantifier matches as few characters as possible, while a greedy quantifier matches as many characters as possible. :0 or more of previous char
+:1 or more of previous char
.:any char

Anchors:
^: The begining. $: The end.

Large Language Model

basic ideas

Zero-Shot Learning

zero-shot learning, in which your model learns how to classify classes that it hasn’t seen before.

Contrastive Language-Image Pretraining (CLIP)

Just like traditional supervised models, CLIP has two stages: the training stage (learning) and the inference stage (making predictions).
In the training stage, CLIP learns about images by “reading” auxiliary text (i.e. sentences) corresponding to each image. CLIP aims to minimize the difference between the encodings of the image and it’s corresponding text.
In the inference stage, we setup the typical classification task by first obtaining a list of all possible labels.Each label will then be encoded by the pretrained text encoder from Step 1.Now that we have the label encodings, T₁ to Tₙ, we can take the image that we want to classify, feed it through the pretrained image encoder, and compute how similar the image encoding is to each text label encoding using a distance metric called cosine similarity.

contrastive learning

Contrastive learning is a machine learning technique used to learn the general features of a dataset without labels by teaching the model which data points are similar or different.It looks at which pairs of data points are “similar” and “different” in order to learn higher-level features about the data, before even having a task such as classification or segmentation.

SimCLRv2

The entire process can be described concisely in three basic steps:

For each image in our dataset, we can perform two augmentation combinations (i.e. crop + resize + recolor, resize + recolor, crop + recolor, etc.). We want the model to learn that these two images are “similar” since they are essentially different versions of the same image.

To do so, we can feed these two images into our deep learning model (Big-CNN such as ResNet) to create vector representations for each image. The goal is to train the model to output similar representations for similar images.

Lastly, we try to maximize the similarity of the two vector representations by minimizing a contrastive loss function.

Meta-learning

The idea of meta-learning is to learn the learning process.

In-context Learning

uring in-context learning, we give the LM a prompt that consists of a list of input-output pairs that demonstrate a task. At the end of the prompt, we append a test input and allow the LM to make a prediction just by conditioning on the prompt and predicting the next tokens.

Instruction learning

Instruction learning is an idea proposed by the team led by Quoc V. Le at Google DeepMind in a paper titled ‘Finetuned Language Models Are Zero-Shot Learners’ in 2021. The purpose of instruction learning and prompt learning is to explore the knowledge inherent in language models. The difference is that prompts aim to stimulate the completion ability of the language model, such as generating the second half of a sentence based on the first half or filling in the blanks. Instructions aim to stimulate the understanding ability of the language model by providing more explicit instructions, enabling the model to take correct actions. The advantage of instruction learning is that after fine-tuning through multitask learning, it can also perform zero-shot learning on other tasks, while prompt learning is specific to one task. Its generalization ability is not as strong as instruction learning.

Diffusion Model

In machine learning, the Diffusion Model refers to a class of algorithms or models that utilize diffusion processes for various tasks, such as data clustering, image segmentation, or graph-based learning. The basic principle of the Diffusion Model in machine learning is to propagate information or labels through the connections or edges of a graph or network. The diffusion process starts with initial information or labels assigned to some nodes in the graph, and it gradually spreads and influences the neighboring nodes based on certain rules or algorithms.

Stable Diffusion

Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions.

Prompt engineering

Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics.Prompt engineering focuses on crafting the optimal textual input by selecting the appropriate words, phrases, sentence structures, and punctuation.

RLHF(Reinforcement Learning from Human Feedback)

generation

Auto-regressive language generation is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions. The length T of the word sequence is usually determined on-the-fly and corresponds to the timestep
t=T the EOS token is generated from the probability distribution.

decoding methods

Greedy search

Greedy search is the simplest decoding method. It selects the word with the highest probability as its next word.

Beam search

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.
The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0. Nevertheless, n-gram penalties have to be used with care. An article generated about the city New York should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text!
When using transformers library:
beam_output = model.generate(**model_inputs,max_new_tokens=40,num_beams=5,no_repeat_ngram_size=2,early_stopping=True)

Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.
In transformers, we simply set the parameter num_return_sequences to the number of highest scoring beams that should be returned. Make sure though that num_return_sequences <= num_beams!

sampling

In its most basic form, sampling means randomly picking the next word according to its conditional probability distribution.

temperature

a temperatureparameter to adjust the probability distribution of the output. The larger the parameter value, the smoother the distribution looks, that is, the gap between high probability and low probability is narrowed (not so sure about the output); of course, the smaller it is, the more obvious the gap between high probability and low probability (more sure about the output). If it tends to 0, it is the same as Greedy Search.

Top-K

In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words.GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

Top-P

Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words.

While in theory, Top-p seems more elegant than Top-K, both methods work well in practice. Top-p can also be used in combination with Top-K, which can avoid very low ranked words while allowing for some dynamic selection.

sample_outputs = model.generate(**model_inputs, max_new_tokens=40,do_sample=True,top_k=50,top_p=0.95,num_return_sequences=3)

models:

LLaMA

LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.

FastChat

other models

https://github.com/baichuan-inc/baichuan-7B

LLM benchmarks

MMLU

The MMLU benchmark covers 57 general knowledge areas such as “Humanities”, “Social Sciences”, and “STEM”. Each question in it contains four possible options, and each question has only one correct answer.
there are two main ways to get information from a model to evaluate it:
Get the output probabilities for a particular set of tokens and compare them to the alternatives in the sample;
Take the text generated by the model (iteratively generated one by one using the method described above), and compare these texts with the alternatives in the sample.

C-Eval

A Chinese knowledge and reasoning test set covering four major fields: humanities, social sciences, natural sciences, and other disciplines. It consists of 52 subjects, including calculus, linear algebra, and more, covering topics from secondary school to university-level studies, graduate studies, and professional examinations. The test set comprises a total of 13,948 questions.

code generation benchmarks

HumanEval

HumanEval is proposed to evaluate the functional correctness on a set of 164 handwritten programming problems with unit tests.
Functional correctness is measured for synthesizing programs from docstrings.Each problem includes a function signature, docstring, body, and several unit tests. pass@k metric, is used where k code samples are generated per problem see if any sample passes the unit tests.

MBPP (Mostly Basic Python Programming)

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases.

APPS(Automated Programming Progress Standard)

The APPS dataset consists of 5000 training and 5000 test examples of coding problems. Most of the APPS tests problems are not formulated as single-function synthesis tasks, but rather as full-program synthesis.The APPS benchmark attempts to mirror how humans programmers are evaluated by posing coding problems in unrestricted natural language and evaluating the correctness of solutions.

MultiPL-E

MultiPL-E is a multi-programming language benchmark for evaluating the code generation performance of large language model (LLMs) of code.

DS-1000

a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas.

references

https://towardsdatascience.com/understanding-zero-shot-learning-making-ml-more-human-4653ac35ccab
https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607
http://ai.stanford.edu/blog/understanding-incontext/
https://www.8btc.com/article/6813626
https://en.wikipedia.org/wiki/Stable_Diffusion
LLaMA: Open and Efficient Foundation Language Models
https://www.promptingguide.ai/
https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76
https://nl2code.github.io/
https://yaofu.notion.site/C-Eval-6b79edd91b454e3d8ea41c59ea2af873
https://huggingface.co/blog/zh/evaluating-mmlu-leaderboard
https://github.com/datawhalechina/hugging-llm/blob/main/content/ChatGPT%E5%9F%BA%E7%A1%80%E7%A7%91%E6%99%AE%E2%80%94%E2%80%94%E7%9F%A5%E5%85%B6%E4%B8%80%E7%82%B9%E6%89%80%E4%BB%A5%E7%84%B6.md
https://huggingface.co/blog/how-to-generate