finetuning large language models

The 3 Conventional Feature-Based and Finetuning Approaches

Feature-Based Approach

In the feature-based approach, we load a pretrained LLM and apply it to our target dataset. Here, we are particularly interested in generating the output embeddings for the training set, which we can use as input features to train a classification model.

Finetuning I – Updating The Output Layers

A popular approach related to the feature-based approach described above is finetuning the output layers (we will refer to this approach as finetuning I). Similar to the feature-based approach, we keep the parameters of the pretrained LLM frozen. We only train the newly added output layers.

Finetuning II – Updating All Layers

when optimizing the modeling performance, the gold standard for using pretrained LLMs is to update all layers.

parameter-efficient finetuning techniques (PEFT)

To finetune LLM with high modeling performance while only requiring the training of only a small number of parameters. These methods are usually referred to as parameter-efficient finetuning techniques (PEFT). Techniques such as prefix tuning, adapters, and low-rank adaptation, all of which “modify” multiple layers, achieve much better predictive performance (at a low cost).

Reinforcement Learning with Human Feedback (RLHF)

In RLHF, human feedback is collected by having humans rank or rate different model outputs, providing a reward signal. The collected reward labels can then be used to train a reward model that is then in turn used to guide the LLMs adaptation to human preferences.

The reward model itself is learned via supervised learning (typically using a pretrained LLM as base model). Next, the reward model is used to update the pretrained LLM that is to be adapted to human preferences — the training uses a flavor of reinforcement learning called proximal policy optimization.

prompt tuning

In a nutshell, prompt tuning (different from prompting) appends a tensor to the embedded inputs of a pretrained LLM. The tensor is then tuned to optimize a loss function for the finetuning task and data while all other parameters in the LLM remain frozen.

The main idea behind prompt tuning, and parameter-efficient finetuning methods in general, is to add a small number of new parameters to a pretrained LLM and only finetune the newly added parameters to make the LLM perform better on (a) a target dataset (for example, a domain-specific dataset like medical or legal documents) and (b) a target task (for example, sentiment classification).

references

https://magazine.sebastianraschka.com/p/finetuning-large-language-models
https://magazine.sebastianraschka.com/p/understanding-parameter-efficient
https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters

Batch processing for sequences

padding

In natural language processing (NLP), padding refers to the practice of adding special tokens to sequences (such as sentences or texts) so that all sequences in a batch have the same length. Padding is essential when working with mini-batch processing in neural networks because it ensures that all sequences in a batch can be processed simultaneously, despite their varying lengths.

Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to.

references

https://huggingface.co/learn/nlp-course/en/chapter2/5?fw=pt

Tokenizers

what is tokenizer

A tokenizer is a crucial component in natural language processing (NLP) and text analysis that breaks down text into smaller, manageable units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements depending on the specific requirements of the application.

how tokenizer works

There are different types of tokenizer methods.Whitespace Tokenizers, Punctuation-Based Tokenizers, Word Tokenizers,Sentence Tokenizers,Character Tokenizers, N-gram Tokenizers, Regular Expression Tokenizers and
Subword Tokenizers.

Word Tokenizers

Word tokenization, also known as lexical analysis, is the process of splitting a piece of text into individual words or tokens. Word tokenization typically involves breaking the text into words based on spaces and punctuation.

Subword Tokenizers

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. A subword tokenizer is a type of tokenizer used in natural language processing (NLP) that breaks down words into smaller units or subwords. This approach is particularly useful for handling rare or out-of-vocabulary words, reducing the vocabulary size, and improving the efficiency of language models.

Common Subword Tokenization Methods

Byte-Pair Encoding (BPE)

BPE is an iterative algorithm that merges the most frequent pairs of characters or subwords in a corpus until a desired vocabulary size is reached.

WordPiece Tokenization

Similar to BPE, WordPiece builds a vocabulary of subwords based on frequency, optimizing for a balance between vocabulary size and the ability to handle rare words.

SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly designed for Neural Network-based text generation systems. It treats the input text as a sequence of Unicode characters and uses a subword model to create subwords.

references

https://huggingface.co/learn/nlp-course/en/chapter2/4

Sampling

Sampling

top-p sampling

This method only considers the tokens whose cumulative probability exceed the probability p and then redistributes the probability mass across the remaining tokens so that the sum of probabilities is 1.

temperature

What the temperature does is: it controls the relative weights in the probability distribution. It controls the extent to which differences in probability play a role in the sampling. At temperature t=0 this sampling technique turns into what we call greedy search/argmax sampling where the token with the highest probability is always selected.

reference

https://blog.ml6.eu/why-openais-api-models-cannot-be-forced-to-behave-fully-deterministically-4934a7e8f184

Measuring sentence similarity

metrics

BLEU (Bilingual Evaluation Understudy)

BLEU computes a score based on the n-gram overlap between the generated text and the reference text, as well as the brevity penalty to handle cases where the generated text is too short. The score ranges from 0 to 1, where 1 indicates a perfect match with the reference translations.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE score measures the similarity between the machine-generated summary and the reference summaries using overlapping n-grams, word sequences that appear in both the machine-generated summary and the reference summaries. ROUGE score ranges from 0 to 1, with higher values indicating better summary quality.

ROUGE scores are branched into ROUGE-N,ROUGE-L, and ROUGE-S.
ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the n-gram overlap.
ROUGE-L measures the longest common subsequence (LCS) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the length of the LCS.
ROUGE-S measures the skip-bigram (bi-gram with at most one intervening word) overlap between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the skip-bigram overlap.

references

https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb

pai - review notes

fundamentals

probability

sample space, event space, σ-algebra and probability space

A probability space is a mathematical construct that consists of three elements: the sample space (S), the event space (E), and a probability measure (P). Additionally, a sigma-algebra (σ-algebra) is associated with the event space.

The sample space is the set of all possible outcomes of an experiment.The event space is a collection of subsets of the sample space.
A sigma-algebra is a collection of subsets of the sample space. It includes the sample space, is closed under complementation, and is closed under countable unions. The probability measure is a function that assigns probabilities to events.

probability mass function (PMF) and cumulative distribution function (CDF),probability density function (PDF)

The Probability Mass Function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to a certain value. The Cumulative Distribution Function (CDF) of a random variable gives the probability that
X takes on a value less than or equal to x.
The Probability Density Function (PDF) is applicable to continuous random variables.The total area under the PDF curve is equal to 1.

Continuous Distributions

normal distribution(Gaussian):The Gaussian CDF cannot be expressed in closed-form. Note that the mean of a Gaussian distribution coincides with the maximizer of its PDF, also called mode of a distribution.

Joint Probability, Conditional Probability, Sum rule, Product rule(chain rule),the law of total probability

Joint probability refers to the probability of the occurrence of two or more events simultaneously. Conditional probability is the probability of one event occurring given that another event has already occurred. The sum rule, or addition rule, gives the probability of the union of two events. The product rule, also known as the chain rule, provides a way to express the joint probability of multiple events. The law of total probability is a way to express the probability of an event B by considering all possible ways in which
B can occur.

independence, conditional independence

Conditional independence does not necessarily imply unconditional independence, and vice versa.

Directed Graphical Models(Bayesian networks)

Directed graphical models (also called Bayesian networks) are often used to visually denote the (conditional) independence relationships of a large number of random variables.

Expectation, Covariance and Variance, Standard deviation,Law of total variance

Change of variables formula

Probabilistic inference

Bayes’ rule,Conjugate Priors

gaussian,Guassian random vector

Any affine transformation of a Gaussian random vector is a Gaussian random vector.

Supervised Learning and Point Estimates

Maximum Likelihood Estimation,Maximum a Posteriori Estimation,

The MLE and MAP estimate can be seen as a naïve approximation of probabilistic inference, represented by a point density which “collapses” all probability mass at the mode of the posterior distribution.

exercise

affine transformation,Jacobian matrix,

PML

linear regression

linear regression(MLE), ridge regression(MAP)

ridge,lasso

least absolute shrinkage and selection operator (lasso): Laplace prior, L1 regularization.
Ridge: Gaussian prior, L2 regularization.

The primary difference lies in the penalty terms: L1 regularization uses the sum of absolute values, and L2 regularization uses the sum of squared values.
L1 regularization tends to result in exact zeros, leading to sparse solutions, whereas L2 regularization generally leads to smaller, non-zero coefficients.

Bayesian linear regression (BLR)

Aleatoric and Epistemic Uncertainty

epistemic uncertainty: corresponds to the uncertainty about our model due to the lack of data. aleatoric uncertainty: “irreducible noise”, cannot be explained by the inputs and any model from the model class.

equation under the law of total variance.

kernel

Filtering

The process of keeping track of the state using noisy observations is also known as Bayesian filtering or recursive Bayesian estimation.

Kalman filter

A Kalman filter is simply a Bayes filter using a Gaussian distribution over the states and conditional linear Gaussians to describe the evolution of states and observations.

Gaussian Process

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.

kernel function, feature space, RKHS,Stationarity and isotropy

A Gaussian process with a linear kernel is equivalent to Bayesian linear regression.

For ν = 1/2, the Matérn kernel is equivalent to the Laplace kernel. For ν → ∞, the Matérn kernel is equivalent to the Gaussian kernel.

Note that stationarity is a necessary condition for isotropy. In other words, isotropy implies stationarity.

skip 4.3.4

model selection

Maximizing the Marginal Likelihood

Marginal likelihood maximization is an empirical Bayes method. Often it is simply referred to as empirical Bayes.
this approach typically avoids overfitting even though we do not use a separate training and validation set. maximizing the marginal likelihood naturally encourages trading between a large likelihood and a large prior.

Approximations

random Fourier features,Bochner’s theorem,Uniform convergence of Fourier features

inducing points method,subset of regressors (SoR) approximation,fully independent training conditional (FITC) approximation

Variational Inference

Laplace Approximation,Bayesian Logistic Regression,

The Laplace approximation matches the shape of the true posterior around its mode but may not represent it accurately elsewhere — often leading to extremely overconfident predictions.

Variational family,mean-field distribution

Information Theory,Surprise,entropy,Jensen’s Inequality,Cross-entropy,KL-divergence,Forward and Reverse KL-divergence

The uniform distribution has the maximum entropy among all discrete distributions supported on {1, . . . , n}.

In words, KL(p∥q) measures the additional expected surprise when observing samples from p that is due to assuming the (wrong) distribution q.

It can be seen that the reverse KL-divergence tends to greedily select the mode and underestimating the variance which, in this case, leads to an overconfident prediction. The forward KL-divergence, in contrast, is more conservative.

Note, however, that reverse-KL is not greedy in the same sense as Laplace approximation, as it does still take the variance into account and does not purely match the mode of p.

Minimizing Forward-KL as Maximum Likelihood Estimation,Minimizing Forward-KL as Moment Matching

First, we observe that minimizing the forward KL-divergence is equivalent to maximum likelihood estimation on an infinitely large sample size.

A Gaussian qλ minimizing KL(p∥qλ) has the same first and second moment as p.

Evidence Lower Bound,Gaussian VI vs Laplace approximation, Gradient of Evidence Lower Bound(score gradients and Reparameterization trick)

maximizing the ELBO coincides with minimizing reverse-KL.

maximizing the ELBO selects a variational distribution q that is close to the prior distribution p(·) while also maximizing the average likelihood of the data p(y1:n | x1:n, θ) for θ ∼ q.

Note that for a noninformative prior p(·) ∝ 1, maximizing the ELBO is equivalent to maximum likelihood estimation.

skip 5.5.2

Markov Chain Monte Carlo Methods

The key idea of Markov chain Monte Carlo methods is to construct a Markov chain, which is efficient to simulate and has the stationary distribution p.

Markov Chains(Stationarity,convergence), Markov property,time-homogeneous Markov chains

Intuitively, the Markov property states that future behavior is independent of past states given the present state.

Stationary distribution,aperiodic,Ergodicity,Fundamental theorem of ergodic Markov chains

After entering a stationary distribution π, a Markov chain will always remain in the stationary distribution.

It can be shown that there exists a unique stationary distribution π if the Markov chain is irreducible, that is, if every state is reachable from every other state with a positive probability when the Markov chain is run for enough steps.

Even if a Markov chain has a unique stationary distribution, it must not converge to it.

In words, a Markov chain is aperiodic iff for every state x, the transition graph has a closed path from x to x with length k for all k ∈ N greater than some k0 ∈ N.

A Markov chain is ergodic iff there exists a t ∈ N0 such that for any x, x′ ∈ S we have p(t)(x′ | x) > 0.

A commonly used strategy to ensure that a Markov chain is ergodic is to add “self-loops” to every vertex in the transition graph.

An ergodic Markov chain has a unique stationary distribution π (with full support) irrespectively of the initial distribution q0. This naturally suggests constructing an ergodic Markov chain such that its stationary distribution coincides with the posterior distribution. If we then sample “sufficiently long”, Xt is drawn from a distribution that is “very close” to the posterior distribution.

How quickly does a Markov chain converge?(Total variation distance and Mixing time)

Detailed Balance Equation,Ergodic Theorem

A Markov chain that satisfies the detailed balance equation with respect to π is called reversible with respect to π.

Ergodic Theorem is a way to generalize the (strong) law of large numbers to Markov chains.

Elementary Sampling Methods,Metropolis-Hastings Algorithm,Metropolis-Hastings theorem, Gibbs Sampling

A popular example of a Metropolis-Hastings algorithm is Gibbs sampling.

Sampling as Optimization,Gibbs distribution,Langevin Dynamics,Metropolis adjusted Langevin algorithm (MALA) or Langevin Monte Carlo (LMC),unadjusted Langevin algorithm (ULA), Stochastic Gradient Langevin Dynamics, Hamiltonian Monte Carlo

A useful property is that Gibbs distributions always have full support. Observe that the posterior distribution can always be interpreted as a Gibbs distribution as long as prior and likelihood have full support.

Langevin dynamics adapts the Gaussian proposals of the Metropolis-Hastings algorithm to search the state space in an “informed” direction. The simple idea is to bias the sampling towards states with lower energy, thereby making it more likely that a proposal is accepted. A natural idea is to shift the proposal distribution perpendicularly to the gradient of the energy function. The resulting variant of Metropolis-Hastings is known as the Metropolis adjusted Langevin algorithm (MALA) or Langevin Monte Carlo (LMC).

The HMC algorithm is an instance of Metropolis-Hastings which uses momentum to propose distant points that conserve energy, with high acceptance probability.

path

target: show that the stationary distribution of a Markov chain coincides with the posterior distribution.
base: Detailed Balance Equation which shows that A Markov chain that satisfies the detailed balance equation with respect to π is called reversible with respect to π and if the Markov chain is reversible with respect to π then π is a stationary distribution. With posterior distribution p(x) = 1/Z q(x), substitute the posterior for π in the detailed balance equation, we can remove z, so we do not need to know the true posterior p to check that the stationary distribution of our Markov chain coincides with p, it suffices to know the finite measure q. However, until now, this does not allow us to estimate expectations over the posterior distribution. Note that although constructing such a Markov chain allows us to obtain samples from the posterior distribution, they are not independent. Thus, the law of large numbers and Hoeffding’s inequality do not apply, but there is a way to generalize the (strong) law of large numbers to Markov chains, which is Ergodic theorem.

Next we need to consider how to construct Markov chain with the goal of approximating samples from the posterior distribution p. One way is Metropolis-Hastings Algorithm, in which we are given a proposal distribution and we use the acceptance distribution to decide whether to follow the proposal. Another way is Gibbs sampling.

MCMC techniques can be generalized to
continuous random variables / vectors. Observe that the posterior distribution can always be interpreted as a Gibbs distribution as long as prior and likelihood have full support.

Langevin dynamics adapts the Gaussian proposals of the Metropolis-Hastings algorithm to search the state space in an “informed” direction. The simple idea is to bias the sampling towards states with lower energy, thereby making it more likely that a proposal is accepted. A natural idea is to shift the proposal distribution perpendicularly to the gradient of the energy function. The resulting variant of Metropolis-Hastings is known as the Metropolis adjusted Langevin algorithm (MALA) or Langevin Monte Carlo (LMC).

skip 6.3.5

BDL

Artificial Neural Networks,Activation Functions,

Non-linear activation functions allow the network to represent arbitrary functions. This is known as the universal approximation theorem.

Bayesian Neural Networks, Heteroscedastic Noise, Variational Inference,Markov Chain Monte Carlo, SWA,SWAG,Dropout and Dropconnect, Probabilistic Ensembles

Intuitively, variational inference in Bayesian neural networks can be interpreted as averaging the predictions of multiple neural networks drawn according to the variational posterior qλ.

Calibration,expected calibration error (ECE), maximum calibration error (MCE), Histogram binning, Isotonic regression, Platt scaling, Temperature scaling

We say that a model is wellcalibrated if its confidence coincides with its accuracy across many predictions. Compare within each bin, how often the model thought the inputs belonged to the class (confidence) with how often the inputs actually belonged to the class (frequency).

Sequential Decision-Making

Active Learning

Conditional Entropy,Joint entropy

A very intuitive property of entropy is its monotonicity: when conditioning on additional observations the entropy can never increase.

Mutual Information(information gain), the law of total expectation, data processing inequality, interaction information(Synergy and Redundancy), Submodularity of Mutual Information, Marginal gain, Submodularity, monotone,

data processing inequality: which says that processing cannot increase the information contained in a signal.

F is submodular iff “adding” x to the smaller set A yields more marginal gain than adding x to the larger set B.

I is monotone submodular.

Maximizing Mutual Information, Greedy submodular function maximization, Uncertainty Sampling, Marginal gain of maximizing mutual information,Bayesian active learning by disagreement (BALD)

skip proof of Theorem 8.15

Therefore, if f is modeled by a Gaussian and we assume homoscedastic noise, greedily maximizing mutual information corresponds to simply picking the point x with the largest variance. This algorithm is also called uncertainty sampling.

Experimental Design, Entropy Search

skip

Bayesian Optimization

Exploration-Exploitation Dilemma,Online Learning and Bandits, Multi-Armed Bandits, Regret, sublinear regret, Cesàro mean

Bayesian optimization can be interpreted as a variant of the MAB problem where there can be a potentially infinite number of actions (arms), but their rewards are correlated (because of the smoothness of the Gaussian process prior).

Acquisition Functions, Upper Confidence Bound, Bayesian confidence intervals, Regret of GP-UCB, Information gain of common kernels, Frequentist confidence intervals, probability of improvement, expected improvement (EI), Thompson Sampling, probability matching, Information-Directed Sampling

skip Information-Directed Sampling

Markov Decision Processes

Planning deals with the problem of deciding which action an agent should play in a (stochastic) environment. A key formalism for probabilistic planning in known environments are so-called Markov decision processes.

Policy,discounted payoff,state value function,state-action value function (also called Q-function)

A policy is a function that maps each state x ∈ X to a probability distribution over the actions.

Bellman Expectation Equation

Policy Evaluation, Fixed-point Iteration, contraction,Banach fixed-point theorem

Policy Optimization, Greedy Policies, Bellman Optimality Equation, Bellman’s theorem, Policy Iteration, Value Iteration,

In particular, if for every state there is a unique action that maximizes the state-action value function, the policy π⋆ is deterministic and unique.

Intuitively, the Bellman optimality equations express that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Value iteration converges to an ε-optimal solution in a polynomial number of iterations. Unlike policy iteration, value iteration does not converge to an exact solution in general.

Partial Observability,

Whereas MDPs are controlled Markov chains, POMDPs are controlled hidden Markov models.

Tabular Reinforcement Learning

Trajectories, episodic setting, continuous setting, On-policy and Off-policy Methods,

on-policy methods are used when the agent has control over its own actions, in other words, the agent can freely choose to follow any policy. off-policy methods can be used even when the agent cannot freely choose its actions.

Model-based Approaches, Balancing Exploration and Exploitation, ε-greedy, Softmax Exploration(Boltzmann exploration), Rmax algorithm

skip Remark 11.3: Asymptotic convergence

Note that ε-greedy is GLIE with probability 1 if the sequence (εt)t∈N0 satisfies the RM-conditions (A.56), e.g., if εt = 1/t.

A significant benefit to model-based reinforcement learning is that it is inherently off-policy. That is, any trajectory regardless of the policy used to obtain it can be used to improve the model of the underlying Markov decision process. In the model-free setting, this not necessarily true.

Model-free Approaches, On-policy Value Estimation, bootstrapping, temporal-difference learning, SARSA, Off-policy Value Estimation, Q-learning, optimistic Q-learning

Model-free Reinforcement Learning

Value Function Approximation, DQN, experience replay, maximization bias, Double DQN

Policy Approximation(policy gradient methods), policy value function, Policy Gradient, Score gradient estimator, Score gradients with baselines, Downstream returns, REINFORCE algorithm

The main advantage of policy gradient methods such as REINFORCE is that they can be used in continuous action spaces. However, REINFORCE is not guaranteed to find an optimal policy. Even when operating in very small domains, REINFORCE can get stuck in local optima.

On-policy Actor-Critics,Advantage Function, Policy Gradient Theorem, Actor-Critic(Q actor-critic,Online actor-critic), advantage actor-critic, bias-variance tradeoff, trust-region policy optimization (TRPO), Proximal policy optimization (PPO)

One approach in the online setting (i.e., non-episodic setting), is to simply use SARSA for learning the critic. To learn the actor, we use stochastic gradient descent with gradients obtained using single samples.

Model-based Reinforcement Learning

Planning over Finite Horizons, receding horizon control (RHC), model predictive control (MPC), Random shooting methods

cheat sheet

from book

conditional distribution for gaussian;
Gaussian process posterior;
the predictive posterior at the test point;
common kernels;
the Hessian of the logistic loss;
Surprise,entropy,Jensen’s Inequality,Cross-entropy,KL-divergence,
ELBO,
the law of large numbers and Hoeffding’s inequality,
Hoeffding bound,

from exercise

Woodbury push-through identity;
Solution to problem 3.6;

pai - Model-based Approximate Reinforcement Learning

model-based reinforcement learning

We face three main challenges in model-based reinforcement learning. First, given a fixed model, we need to perform planning to decide on which actions to play. Second, we need to learn models f and r accurately and efficiently. Third, we need to effectively trade exploration and exploitation.

Planning

Deterministic Dynamics

To begin with, let us assume that our dynamics model is deterministic and known. That is, given a state-action pair, we know the subsequent state.

pai - Model-free Approximate Reinforcement Learning

Model-free Approximate Reinforcement Learning

Tabular Reinforcement Learning as Optimization

In particular, in the tabular setting (i.e., over a discrete domain), we can parameterize the value function exactly by learning a separate parameter for each state.
Alt text

Alt text

Now, we cannot compute this derivative because we cannot compute the expectation. Firstly, the expectation is over the true value function which is unknown to us. Secondly, the expectation is over the transition model which we are trying to avoid in model-free methods. To resolve the first issue, analogously to TD-learning, instead of learning the true value function vπ which is unknown, we learn the bootstrapping estimate Vπ. To resolve the first issue, analogously to TD-learning, instead of learning the true value function vπ which is unknown, we learn the bootstrapping estimate Vπ. we will use a Monte Carlo estimate using a single sample. Recall that this is only possible because the transitions are conditionally independent given the state-action pair.

Alt text

Alt text

Therefore, TD-learning is essentially performing stochastic gradient descent using the TD-error as an unbiased gradient estimate.
Stochastic gradient descent with a bootstrapping estimate is also called stochastic semi-gradient descent.

Value Function Approximation

Our goal for large state-action spaces is to exploit the smoothness properties5 of the value function to condense the representation.

Heuristics

The vanilla stochastic semi-gradient descent is very slow.
There are mainly two problems.
As we are trying to learn an approximate value function that depends on the bootstrapping estimate, this means that the optimization target is “moving” between iterations. In practice, moving targets lead to stability issues. One such technique aiming to “stabilize” the optimization targets is called neural fitted Q-iteration or deep Q-networks (DQN). DQN updates the neural network used for the approximate bootstrapping estimate infrequently to maintain a constant optimization target across multiple episodes. One approach is to clone the neural network and maintain one changing neural network (“online network”) for the most recent estimate of the Q-function which is parameterized by θ, and one fixed neural network (“target network”) used as the target which is parameterized by θold and which is updated infrequently.

Alt text

This technique is known as experience replay. Another approach is Polyak averaging where the target network is gradually “nudged” by the neural network used to estimate the Q function.

Now, observe that the estimates Q⋆ are noisy estimates of q⋆. The fact that the update rules can be affected by inaccuracies (i.e., noise in the estimates) of the learned Q-function is known as the “maximization bias”. Double DQN (DDQN) is an algorithm that addresses this maximization bias. Instead of picking the optimal action with respect to the old network, it picks the optimal action with respect to the new network. Intuitively, this change ensures that the evaluation of the target network is consistent with the updated Q-function, which makes it more difficult for the algorithm to be affected by noise.

Alt text

Policy Approximation

Methods that find an approximate policy are also called policy search methods or policy gradient methods. Policy gradient methods use randomized policies for encouraging exploration.

Estimating Policy Values

Alt text
The policy value function measures the expected discounted payoff of policy π.

Alt text
Alt text

Reinforce Gradient

Alt text

Alt text
In this context, however, we cannot apply the reparameterization trick. Fortunately, there is another way of estimating this gradient.

Alt text

Alt text

When using a neural network for the parameterization of the policy π, we can use automatic differentiation to obtain unbiased gradient estimates. However, it turns out that the variance of these estimates is very large. Using so-called baselines can reduce the variance dramatically.

The baseline essentially captures the expected or average value, providing a reference point. By subtracting this reference point, the updates become more focused on the deviations from the expected values, which can reduce the variance in these deviations.

Alt text

Alt text

Alt text

Alt text

Typically, policy gradient methods are slow due to the large variance in the score gradient estimates. Because of this, they need to take small steps and require many rollouts of a Markov chain. Moreover, we cannot reuse data from previous rollouts, as policy gradient methods are fundamentally on-policy.

Actor-Critic Methods

Actor-Critic methods reduce the variance of policy gradient estimates by using ideas from value function approximation. They use function approximation both to approximate value functions and to approximate policies.

Advantage Function

Alt text

Intuitively, the advantage function is a shifted version of the state-action function q that is relative to 0. It turns out that using this quantity instead, has numerical advantages.

Policy Gradient Theorem

Alt text

Alt text

Intuitively, ρθ(x) measures how often we visit state x when following policy πθ. It can be thought of as a “discounted frequency”. Importantly, ρθ is not a probability distribution, as it is not normalized to integrate to 1. Instead, ρθ is what is often called a finite measure. Therefore, eq. (12.57) is not a real expectation!

On-policy Actor-Critics

Alt text

OAC

Alt text
Due to the use of TD-learning for learning the critic, this algorithm is fundamentally on-policy.

A2C

Alt text

that the Q-function is an absolute quantity, whereas the advantage function is a relative quantity, where the sign is informative for the gradient direction. Intuitively, an absolute value is harder to estimate than the sign. Actor-Critic methods are therefore often implemented with respect to the advantage function rather than the Q-function.

GAE/GAAE

Taking a step back, observe that the policy gradient methods such as REINFORCE generally have high variance in their gradient estimates. However, due to using Monte Carlo estimates of Gt, the gradient estimates are unbiased. In contrast, using a bootstrapped Q-function to obtain gradient estimates yields estimates with a smaller variance, but those estimates are biased. We are therefore faced with a bias-variance tradeoff. A natural approach is therefore to blend both gradient estimates to allow for effectively trading bias and variance. This leads to algorithms such as generalized advantage estimation (GAE/GAAE).

Improving sample efficiency(TRPO/PPO)

Actor-Critic methods generally suffer from low sample efficiency. One well-known variant that slightly improves the sample efficiency is trust-region policy optimization (TRPO).

Alt text
Intuitively, taking the expectation with respect to the previous policy πθk , means that we can reuse data from rollouts within the same iteration. TRPO allows reusing past data as long as it can still be “trusted”. This makes TRPO “somewhat” off-policy. Fundamentally, though, TRPO is still an on-policy method. Proximal policy optimization (PPO) is a heuristic variant of TRPO that often works well in practice.

Off-policy Actor-Critics

These algorithms use the reparameterization gradient estimates, instead of score gradient estimators.

DDPG

As our method is off-policy, a simple idea in continuous action spaces is to add Gaussian noise to the action selected by πθ — also known as Gaussian noise “dithering”. This corresponds to an algorithm called deep deterministic policy gradients.
This algorithm is essentially equivalent to Q-learning with function approximation (e.g., DQN), with the only exception that we replace the maximization over actions with the learned policy πθ.
Alt text

Twin delayed DDPG (TD3) is an extension of DDPG that uses two separate critic networks for predicting the maximum action and evaluating the policy. This addresses the maximization bias akin to Double-DQN. TD3 also applies delayed updates to the actor network, which increases stability.

Off-Policy Actor Critics with Randomized Policies

Alt text

The algorithm that uses eq. (12.81) to obtain gradients for the critic and reparameterization gradients for the actor is called stochastic value gradients (SVG).

Off-policy Actor-Critics with Entropy Regularization

In practice, algorithms like SVG often do not explore enough. A key issue with relying on randomized policies for exploration is that they might collapse to deterministic policies.

A simple trick that encourages a little bit of extra exploration is to regularize the randomized policies “away” from putting all mass on a single action. This approach is known as entropy regularization and it leads to an analogue of Markov decision processes called entropy-regularized Markov decision process, where suitably defined regularized state-action value functions (so-called soft value functions) are used.

soft actor critic(SAC)

Alt text

references

https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665
https://github.com/tsmatz/reinforcement-learning-tutorials/blob/master/06-sac.ipynb
https://spinningup.openai.com/en/latest/algorithms/sac.html
https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

pai - Tabular Reinforcement Learning

Tabular Reinforcement Learning

The Reinforcement Learning Problem

Reinforcement learning is concerned with probabilistic planning in unknown environments. In this chapter, we will begin by considering reinforcement learning with small state and action spaces. This setting is often called the tabular setting, as the value functions can be computed exhaustively for all states and stored in a table.

Clearly, the agent needs to trade exploring and learning about the environment with exploiting its knowledge to maximize rewards. In fact, Bayesian optimization can be viewed as reinforcement learning with a fixed state and a continuous action space: In each round, the agent plays an action, aiming to find the action that maximizes the reward.Another key challenge of reinforcement learning is that the observed data is dependent on the played actions.

Trajectories

Alt text

Crucially, the newly observed states xt+1 and the rewards rt (across multiple transitions) are conditionally independent given the previous states xt and actions at. This independence property is crucial for being able to learn about the underlying Markov decision process. Notably, this implies that we can apply the law of large numbers (1.83) and Hoeffding’s inequality (1.87) to our estimators of both quantities.

The collection of data is commonly classified into two settings. In the episodic setting, the agent performs a sequence of “training” rounds (called episodes). In the beginning of each round, the agent is reset to some initial state. In contrast, in the continuous setting (or non-episodic,or online setting), the agent learns online.

control

Another important distinction in how data is collected, is the distinction between on-policy and off-policy control. As the names suggest, on-policy methods are used when the agent has control over its own actions, in other words, the agent can freely choose to follow any policy. In contrast, off-policy methods can be used even when the agent cannot freely choose its actions. Off-policy methods are therefore able to make use of observational data.Off-policy methods are therefore more sample-efficient than on-policy methods. This is crucial, especially in settings where conducting experiments (i.e., collecting new data) is expensive.

On-Policy learning algorithms are the algorithms that evaluate and improve the same policy which is being used to select actions. Off-Policy learning algorithms evaluate and improve a policy that is different from Policy that is used for action selection.

To understand the difference between on-policy learning and off-policy learning one must first understand the difference between the behavior policy (i.e., sampling policy) and the update policy. The behavior policy is the policy an agent follows when choosing which action to take in the environment at each time step. The update policy is how the agent updates the Q-function. On-policy algorithms attempt to improve upon the current behavior policy that is used to make decisions and therefore these algorithms learn the value of the policy carried out by the agent, Off-policy algorithms learn the value of the optimal policy and can improve upon a policy that is different from the behavior policy. Determining if the update and behavior policy are the same or different can give us insight into whether or not the algorithm is on-policy or off-policy.

Model-based Approaches

Approaches to reinforcement learning are largely categorized into two classes. Model-based approaches aim to learn the underlying Markov decision process. In contrast, model-free approaches learn the value function directly.

Learning the Underlying Markov Decision Process

A natural first idea is to use maximum likelihood estimation to approximate transition and reward function.

Alt text

ε-greedy Algorithm

Alt text

The key problem of ε-greedy is that it explores the state space in an uninformed manner. In other words, it explores ignoring all past experience. It thus does not eliminate clearly suboptimal actions.

Rmax Algorithm

A key principle in effectively trading exploration and exploitation is “optimism in the face of uncertainty”. Let us apply this principle to the reinforcement learning setting. The key idea is to assume that the dynamics and rewards model “work in our favor” until we have learned “good estimates” of the true dynamics and rewards.

Alt text

How many transitions are “enough”? We can use Hoeffding’s inequality to get a rough idea!

Alt text

challenges

Model-free Approaches

A significant benefit to model-based reinforcement learning is that it is inherently off-policy. That is, any trajectory regardless of the policy used to obtain it can be used to improve the model of the underlying Markov decision process. In the model-free setting, this not necessarily true.

On-policy Value Estimation

Alt text

Note that to estimate this expectation we use a single(!) sample.However, there is one significant problem in this approximation. Our approximation of vπ does in turn depend on the (unknown) true value of vπ. The key idea is to use a bootstrapping estimate of the value function instead. That is, in place of the true value function vπ, we will use a “running estimate” Vπ. In other words, whenever observing a new transition, we use our previous best estimate of vπ to obtain a new estimate Vπ.

Crucially, using a bootstrapping estimate generally results in biased estimates of the value function. Moreover, due to relying on a single sample, the estimates tend to have very large variance.

TD-learning

The variance of the estimate is typically reduced by mixing new estimates of the value function with previous estimates using a learning rate αt. This yields the temporal-difference learning algorithm.

Alt text

TD-learning is a fundamentally on-policy method. That is, for the estimates Vπ to converge to the true value function vπ, the transitions that are used for the estimation must follow policy π.

SARSA

Alt text

Off-policy Value Estimation

Alt text
This adapted update rule explicitly chooses the subsequent action a′ according to policy π whereas SARSA absorbs this choice into the Monte Carlo approximation. The algorithm has analogous convergence guarantees to those of SARSA. Crucially, this algorithm is off-policy. As noted, the key difference to the on-policy TD-learning and SARSA is that our estimate of the Qfunction explicitly keeps track of the next-performed action. It does so for any action in any state.

Q-learning

Alt text
Crucially, the Monte Carlo approximation of eq. (11.21) does not depend on the policy. Thus, Q-learning is an off-policy method.
Alt text

Optimistic Q-learning

Alt text

Challenges

We have seen that both the model-based Rmax algorithm and the modelfree Q-learning take time polynomial in the number of states |X| and the number of actions |A| to converge. While this is acceptable in small grid worlds, this is completely unacceptable for large state and action spaces.

references

https://core-robotics.gatech.edu/2022/02/28/bootcamp-summer-2020-week-4-on-policy-vs-off-policy-reinforcement-learning/