PCA

关于PCA为什么要中心化

因为不做zero mean,根本做不了PCA。从线性变换的本质来说,PCA就是在线性空间做一个旋转(数据矩阵右乘协方差矩阵的特征向量矩阵),然后取低维子空间(实际上就是前ncomponents个特征向量张成的子空间)上的投影点来代替原本的点,以达到降维的目的,注意我说的,只做了旋转,没有平移,所以首先你要保证原本空间里的点是以原点为中心分布的,这就是zero mean的目的。另外如果自己手撸过PCA的算法就知道了,explained_variance和explainedvariance_ratio是怎么实现的?explainedvariance就是协方差矩阵的每个特征值除以sample数,而explained_variance_ratio是每个特征值除以所有特征值之和。为什么这么简单呢?这也和zero mean有关,如果你用最大投影长度的证法去证明PCA就会在过程中很自然的发现这一点,在这里我就不展开了。

链接:https://www.zhihu.com/question/40956812/answer/848527057

PCA 去中心化不一定是必需的,这一结论成立的前提是严格使用协方差矩阵的定义式 S=XXT−nμμT,而不是用 XXT 来当作协方差矩阵。

Centering is an important pre-processing step because it ensures that the resulting components are only looking at the variance within the dataset, and not capturing the overall mean of the dataset as an important variable (dimension). Without mean-centering, the first principal component found by PCA might correspond with the mean of the data instead of the direction of maximum variance.
https://towardsdatascience.com/tidying-up-with-pca-an-introduction-to-principal-components-analysis-f876599af383

Why is minimizing squared residuals equivalent to maximizing variance?
Consider a datapoint (row of ). Then the contribution of that datapoint to the variance is , or equivalently the squared Euclidean length . Applying the Pythagorean theorem shows that this total variance equals the sum of variance lost (the squared residual) and variance remaining. Thus, it is equivalent to either maximize remaining variance or minimize lost variance to find the principal components.

http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/

PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.
https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia

PCA focuses on “explaining” the data matrix using the sample means plus the eigencomponents. When the column mean is far from the origin, the first right singular value is usually quite highly correlated with column mean - thus using PCA concentrates on the second, third and sometimes higher order singular vectors. This is a loss of information when the mean is informative for the process under study. On the other hand, when the scatterplot of the data is roughly elliptical, the PCs typically align with the major axes of the ellipse. Due to the uncorrelatedness constraint, if the mean is far from the origin, the first singular vector will be close to the mean and the others will be tilted away form the major axes of the ellipse. Thus the first singular vector will not be informative about the spread of the data, and the second and third singular values will not be in the most informative directions. Generally, PCA will be more informative, particularly as a method for plotting the data, than uncentered SVD.
https://online.stat.psu.edu/stat555/node/94/

Since X is zero centered we can think of them as capturing the spread of the data around the mean in a sense reminiscent of PCA.
https://intoli.com/blog/pca-and-svd/

that reconstruction error is minimized by taking as columns of W some k orthonormal vectors maximizing the total variance of the projection.
https://stats.stackexchange.com/questions/130721/what-norm-of-the-reconstruction-error-is-minimized-by-the-low-rank-approximation

PCA is a regressional model without intercept1. Thus, principal components inevitably come through the origin. If you forget to center your data, the 1st principal component may pierce the cloud not along the main direction of the cloud, and will be (for statistics purposes) misleading.
https://stats.stackexchange.com/questions/22329/how-does-centering-the-data-get-rid-of-the-intercept-in-regression-and-pca

Centering brings in a big difference. PCA with centering maximizes SS deviations from the mean (i.e. variance); PCA on raw data maximizes SS deviations from the zero point.
https://stats.stackexchange.com/questions/489037/principal-components-with-and-without-centering?noredirect=1&lq=1

SVD and PCA

https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca

singular value decomposition

SVD is basically a matrix factorization technique, which decomposes any matrix into 3 generic and familiar matrices.

Eigenvalues and Eigenvectors

The concept of eigenvectors is applicable only for square matrices. The vector space spanned by an eigenvector is called an eigenspace.
A square matrix is called a diagonalizable matrix if it can be written in the format: $ A=PDP^{-1} $, D is the diagonal matrix comprises of the eigenvalues as diagonal elements
A Symmetric Matrix where the matrix is equal to the transpose of itself.
Special properties of a Symmetric Matrix with respect to eigenvalues and eigenvectors:Has only Real eigenvalues;Always diagonalizable;Has orthogonal eigenvectors.
A matrix is called an Orthogonal Matrix if the transpose of the matrix is the inverse of that matrix.
ince the eigenvectors of a Symmetric matrix are orthogonal to each other, matrix P in the diagonalized matrix A is an orthogonal matrix. So we say that any Symmetric Matrix is Orthogonally Diagonalizable.

https://towardsdatascience.com/singular-value-decomposition-and-its-applications-in-principal-component-analysis-5b7a5f08d0bd

% For the PCA derived from maximal preserved variance \cite{lee2007nonlinear}, we have the covariance
% of $\mathbf{y}$, which is
% \begin{equation}
% \mathbf{C}_{\mathbf{y} \mathbf{y}}=E\left{\mathbf{y} \mathbf{y}^T\right}
% \end{equation}
% This equation is valid only when $\mathbf{y}$ is centered.
The goal of PCA is to maximize the variance of the data along each of the principal components. Centering is an important step because it ensures that the resulting components are only looking at the variance of features, and not capturing the means of the features as important. Without mean-centering, the first principal component found by PCA might correspond with the mean of the data instead of the direction of maximum variance.

oach in the online setting (i.e., non-episodic setting), is to simply use SARSA for learning the critic. To learn the actor, we use stochastic gradient descent with gradients obtained using single samples.

Model-based Reinforcement Learning

Planning over Finite Horizons, receding horizon control (RHC), model predictive control (MPC), Random shooting methods

cheat sheet

from book

conditional distribution for gaussian;
Gaussian process posterior;
the predictive posterior at the test point;
common kernels;
the Hessian of the logistic loss;
Surprise,entropy,Jensen’s Inequality,Cross-entropy,KL-divergence,
ELBO,
the law of large numbers and Hoeffding’s inequality,
Hoeffding bound,

from exercise

Woodbury push-through identity;
Solution to problem 3.6;

pai - Model-based Approximate Reinforcement Learning

model-based reinforcement learning

We face three main challenges in model-based reinforcement learning. First, given a fixed model, we need to perform planning to decide on which actions to play. Second, we need to learn models f and r accurately and efficiently. Third, we need to effectively trade exploration and exploitation.

Planning

Deterministic Dynamics

To begin with, let us assume that our dynamics model is deterministic and known. That is, given a state-action pair, we know the subsequent state.

pai - Model-free Approximate Reinforcement Learning

Model-free Approximate Reinforcement Learning

Tabular Reinforcement Learning as Optimization

In particular, in the tabular setting (i.e., over a discrete domain), we can parameterize the value function exactly by learning a separate parameter for each state.
Alt text

Alt text

Now, we cannot compute this derivative because we cannot compute the expectation. Firstly, the expectation is over the true value function which is unknown to us. Secondly, the expectation is over the transition model which we are trying to avoid in model-free methods. To resolve the first issue, analogously to TD-learning, instead of learning the true value function vπ which is unknown, we learn the bootstrapping estimate Vπ. To resolve the first issue, analogously to TD-learning, instead of learning the true value function vπ which is unknown, we learn the bootstrapping estimate Vπ. we will use a Monte Carlo estimate using a single sample. Recall that this is only possible because the transitions are conditionally independent given the state-action pair.

Alt text

Alt text

Therefore, TD-learning is essentially performing stochastic gradient descent using the TD-error as an unbiased gradient estimate.
Stochastic gradient descent with a bootstrapping estimate is also called stochastic semi-gradient descent.

Value Function Approximation

Our goal for large state-action spaces is to exploit the smoothness properties5 of the value function to condense the representation.

Heuristics

The vanilla stochastic semi-gradient descent is very slow.
There are mainly two problems.
As we are trying to learn an approximate value function that depends on the bootstrapping estimate, this means that the optimization target is “moving” between iterations. In practice, moving targets lead to stability issues. One such technique aiming to “stabilize” the optimization targets is called neural fitted Q-iteration or deep Q-networks (DQN). DQN updates the neural network used for the approximate bootstrapping estimate infrequently to maintain a constant optimization target across multiple episodes. One approach is to clone the neural network and maintain one changing neural network (“online network”) for the most recent estimate of the Q-function which is parameterized by θ, and one fixed neural network (“target network”) used as the target which is parameterized by θold and which is updated infrequently.

Alt text

This technique is known as experience replay. Another approach is Polyak averaging where the target network is gradually “nudged” by the neural network used to estimate the Q function.

Now, observe that the estimates Q⋆ are noisy estimates of q⋆. The fact that the update rules can be affected by inaccuracies (i.e., noise in the estimates) of the learned Q-function is known as the “maximization bias”. Double DQN (DDQN) is an algorithm that addresses this maximization bias. Instead of picking the optimal action with respect to the old network, it picks the optimal action with respect to the new network. Intuitively, this change ensures that the evaluation of the target network is consistent with the updated Q-function, which makes it more difficult for the algorithm to be affected by noise.

Alt text

Policy Approximation

Methods that find an approximate policy are also called policy search methods or policy gradient methods. Policy gradient methods use randomized policies for encouraging exploration.

Estimating Policy Values

Alt text
The policy value function measures the expected discounted payoff of policy π.

Alt text
Alt text

Reinforce Gradient

Alt text

Alt text
In this context, however, we cannot apply the reparameterization trick. Fortunately, there is another way of estimating this gradient.

Alt text

Alt text

When using a neural network for the parameterization of the policy π, we can use automatic differentiation to obtain unbiased gradient estimates. However, it turns out that the variance of these estimates is very large. Using so-called baselines can reduce the variance dramatically.

The baseline essentially captures the expected or average value, providing a reference point. By subtracting this reference point, the updates become more focused on the deviations from the expected values, which can reduce the variance in these deviations.

Alt text

Alt text

Alt text

Alt text

Typically, policy gradient methods are slow due to the large variance in the score gradient estimates. Because of this, they need to take small steps and require many rollouts of a Markov chain. Moreover, we cannot reuse data from previous rollouts, as policy gradient methods are fundamentally on-policy.

Actor-Critic Methods

Actor-Critic methods reduce the variance of policy gradient estimates by using ideas from value function approximation. They use function approximation both to approximate value functions and to approximate policies.

Advantage Function

Alt text

Intuitively, the advantage function is a shifted version of the state-action function q that is relative to 0. It turns out that using this quantity instead, has numerical advantages.

Policy Gradient Theorem

Alt text

Alt text

Intuitively, ρθ(x) measures how often we visit state x when following policy πθ. It can be thought of as a “discounted frequency”. Importantly, ρθ is not a probability distribution, as it is not normalized to integrate to 1. Instead, ρθ is what is often called a finite measure. Therefore, eq. (12.57) is not a real expectation!

On-policy Actor-Critics

Alt text

OAC

Alt text
Due to the use of TD-learning for learning the critic, this algorithm is fundamentally on-policy.

A2C

Alt text

that the Q-function is an absolute quantity, whereas the advantage function is a relative quantity, where the sign is informative for the gradient direction. Intuitively, an absolute value is harder to estimate than the sign. Actor-Critic methods are therefore often implemented with respect to the advantage function rather than the Q-function.

GAE/GAAE

Taking a step back, observe that the policy gradient methods such as REINFORCE generally have high variance in their gradient estimates. However, due to using Monte Carlo estimates of Gt, the gradient estimates are unbiased. In contrast, using a bootstrapped Q-function to obtain gradient estimates yields estimates with a smaller variance, but those estimates are biased. We are therefore faced with a bias-variance tradeoff. A natural approach is therefore to blend both gradient estimates to allow for effectively trading bias and variance. This leads to algorithms such as generalized advantage estimation (GAE/GAAE).

Improving sample efficiency(TRPO/PPO)

Actor-Critic methods generally suffer from low sample efficiency. One well-known variant that slightly improves the sample efficiency is trust-region policy optimization (TRPO).

Alt text
Intuitively, taking the expectation with respect to the previous policy πθk , means that we can reuse data from rollouts within the same iteration. TRPO allows reusing past data as long as it can still be “trusted”. This makes TRPO “somewhat” off-policy. Fundamentally, though, TRPO is still an on-policy method. Proximal policy optimization (PPO) is a heuristic variant of TRPO that often works well in practice.

Off-policy Actor-Critics

These algorithms use the reparameterization gradient estimates, instead of score gradient estimators.

DDPG

As our method is off-policy, a simple idea in continuous action spaces is to add Gaussian noise to the action selected by πθ — also known as Gaussian noise “dithering”. This corresponds to an algorithm called deep deterministic policy gradients.
This algorithm is essentially equivalent to Q-learning with function approximation (e.g., DQN), with the only exception that we replace the maximization over actions with the learned policy πθ.
Alt text

Twin delayed DDPG (TD3) is an extension of DDPG that uses two separate critic networks for predicting the maximum action and evaluating the policy. This addresses the maximization bias akin to Double-DQN. TD3 also applies delayed updates to the actor network, which increases stability.

Off-Policy Actor Critics with Randomized Policies

Alt text

The algorithm that uses eq. (12.81) to obtain gradients for the critic and reparameterization gradients for the actor is called stochastic value gradients (SVG).

Off-policy Actor-Critics with Entropy Regularization

In practice, algorithms like SVG often do not explore enough. A key issue with relying on randomized policies for exploration is that they might collapse to deterministic policies.

A simple trick that encourages a little bit of extra exploration is to regularize the randomized policies “away” from putting all mass on a single action. This approach is known as entropy regularization and it leads to an analogue of Markov decision processes called entropy-regularized Markov decision process, where suitably defined regularized state-action value functions (so-called soft value functions) are used.

soft actor critic(SAC)

Alt text

references

https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665
https://github.com/tsmatz/reinforcement-learning-tutorials/blob/master/06-sac.ipynb
https://spinningup.openai.com/en/latest/algorithms/sac.html
https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

pai - Tabular Reinforcement Learning

Tabular Reinforcement Learning

The Reinforcement Learning Problem

Reinforcement learning is concerned with probabilistic planning in unknown environments. In this chapter, we will begin by considering reinforcement learning with small state and action spaces. This setting is often called the tabular setting, as the value functions can be computed exhaustively for all states and stored in a table.

Clearly, the agent needs to trade exploring and learning about the environment with exploiting its knowledge to maximize rewards. In fact, Bayesian optimization can be viewed as reinforcement learning with a fixed state and a continuous action space: In each round, the agent plays an action, aiming to find the action that maximizes the reward.Another key challenge of reinforcement learning is that the observed data is dependent on the played actions.

Trajectories

Alt text

Crucially, the newly observed states xt+1 and the rewards rt (across multiple transitions) are conditionally independent given the previous states xt and actions at. This independence property is crucial for being able to learn about the underlying Markov decision process. Notably, this implies that we can apply the law of large numbers (1.83) and Hoeffding’s inequality (1.87) to our estimators of both quantities.

The collection of data is commonly classified into two settings. In the episodic setting, the agent performs a sequence of “training” rounds (called episodes). In the beginning of each round, the agent is reset to some initial state. In contrast, in the continuous setting (or non-episodic,or online setting), the agent learns online.

control

Another important distinction in how data is collected, is the distinction between on-policy and off-policy control. As the names suggest, on-policy methods are used when the agent has control over its own actions, in other words, the agent can freely choose to follow any policy. In contrast, off-policy methods can be used even when the agent cannot freely choose its actions. Off-policy methods are therefore able to make use of observational data.Off-policy methods are therefore more sample-efficient than on-policy methods. This is crucial, especially in settings where conducting experiments (i.e., collecting new data) is expensive.

On-Policy learning algorithms are the algorithms that evaluate and improve the same policy which is being used to select actions. Off-Policy learning algorithms evaluate and improve a policy that is different from Policy that is used for action selection.

To understand the difference between on-policy learning and off-policy learning one must first understand the difference between the behavior policy (i.e., sampling policy) and the update policy. The behavior policy is the policy an agent follows when choosing which action to take in the environment at each time step. The update policy is how the agent updates the Q-function. On-policy algorithms attempt to improve upon the current behavior policy that is used to make decisions and therefore these algorithms learn the value of the policy carried out by the agent, Off-policy algorithms learn the value of the optimal policy and can improve upon a policy that is different from the behavior policy. Determining if the update and behavior policy are the same or different can give us insight into whether or not the algorithm is on-policy or off-policy.

Model-based Approaches

Approaches to reinforcement learning are largely categorized into two classes. Model-based approaches aim to learn the underlying Markov decision process. In contrast, model-free approaches learn the value function directly.

Learning the Underlying Markov Decision Process

A natural first idea is to use maximum likelihood estimation to approximate transition and reward function.

Alt text

ε-greedy Algorithm

Alt text

The key problem of ε-greedy is that it explores the state space in an uninformed manner. In other words, it explores ignoring all past experience. It thus does not eliminate clearly suboptimal actions.

Rmax Algorithm

A key principle in effectively trading exploration and exploitation is “optimism in the face of uncertainty”. Let us apply this principle to the reinforcement learning setting. The key idea is to assume that the dynamics and rewards model “work in our favor” until we have learned “good estimates” of the true dynamics and rewards.

Alt text

How many transitions are “enough”? We can use Hoeffding’s inequality to get a rough idea!

Alt text

challenges

Model-free Approaches

A significant benefit to model-based reinforcement learning is that it is inherently off-policy. That is, any trajectory regardless of the policy used to obtain it can be used to improve the model of the underlying Markov decision process. In the model-free setting, this not necessarily true.

On-policy Value Estimation

Alt text

Note that to estimate this expectation we use a single(!) sample.However, there is one significant problem in this approximation. Our approximation of vπ does in turn depend on the (unknown) true value of vπ. The key idea is to use a bootstrapping estimate of the value function instead. That is, in place of the true value function vπ, we will use a “running estimate” Vπ. In other words, whenever observing a new transition, we use our previous best estimate of vπ to obtain a new estimate Vπ.

Crucially, using a bootstrapping estimate generally results in biased estimates of the value function. Moreover, due to relying on a single sample, the estimates tend to have very large variance.

TD-learning

The variance of the estimate is typically reduced by mixing new estimates of the value function with previous estimates using a learning rate αt. This yields the temporal-difference learning algorithm.

Alt text

TD-learning is a fundamentally on-policy method. That is, for the estimates Vπ to converge to the true value function vπ, the transitions that are used for the estimation must follow policy π.

SARSA

Alt text

Off-policy Value Estimation

Alt text
This adapted update rule explicitly chooses the subsequent action a′ according to policy π whereas SARSA absorbs this choice into the Monte Carlo approximation. The algorithm has analogous convergence guarantees to those of SARSA. Crucially, this algorithm is off-policy. As noted, the key difference to the on-policy TD-learning and SARSA is that our estimate of the Qfunction explicitly keeps track of the next-performed action. It does so for any action in any state.

Q-learning

Alt text
Crucially, the Monte Carlo approximation of eq. (11.21) does not depend on the policy. Thus, Q-learning is an off-policy method.
Alt text

Optimistic Q-learning

Alt text

Challenges

We have seen that both the model-based Rmax algorithm and the modelfree Q-learning take time polynomial in the number of states |X| and the number of actions |A| to converge. While this is acceptable in small grid worlds, this is completely unacceptable for large state and action spaces.

references

https://core-robotics.gatech.edu/2022/02/28/bootcamp-summer-2020-week-4-on-policy-vs-off-policy-reinforcement-learning/

pai - Markov Decision Processes

Markov Decision Processes

Planning deals with the problem of deciding which action an agent should play in a (stochastic) environment(An environment is stochastic as opposed to deterministic, when the outcome of actions is random.). A key formalism for probabilistic plan ning in known environments are so-called Markov decision processes.
Alt text

Our fundamental objective is to learn how the agent should behave to optimize its reward. In other words, given its current state, the agent should decide (optimally) on the action to play. Such a decision map — whether optimal or not — is called a policy.

Alt text

Alt text

For the purpose of our discussion of Markov decision processes and reinforcement learning, we will focus on a very common reward called discounted payoff.

Alt text

Alt text

Because we assumed stationary dynamics, rewards, and policies, the discounted payoff starting from a given state x will be independent of the start time t.

Bellman Expectation Equation

Let us now see how we can compute the value function:

Alt text

This equation is known as the Bellman expectation equation, and it shows a recursive dependence of the value function on itself. The intuition is clear: the value of the current state corresponds to the reward from the next action plus the discounted sum of all future rewards obtained from the subsequent states.

Policy Evaluation

Bellman’s expectation equation tells us how we can find the value function vπ of a fixed policy π using a system of linear equations.

Alt text

Fixed-point Iteration

Alt text

Policy Optimization

Greedy Policies

Alt text

Bellman Optimality Equation

Alt text

Alt text
These equations are also called the Bellman optimality equations. Intuitively, the Bellman optimality equations express that the value of a state under an optimal policy must equal the expected return for the best action from that state. Bellman’s theorem is also known as Bellman’s optimality principle, which is a more general concept.

Policy Iteration

Alt text

It can be shown that policy iteration converges to an exact solution in a polynomial number of iterations.Each iteration of policy iteration requires computing the value function, which we have seen to be of cubic complexity in the number of states.

Value Iteration

Alt text
Value iteration converges to an ε-optimal solution in a polynomial number of iterations. Unlike policy iteration, value iteration does not converge to an exact solution in general.an iteration of 7 Sparsity refers to the interconnectivity of the state space. When only few states are reachable from any state, we call an MDP sparse. value iteration can be performed in (virtually) constant time in sparse Markov decision processes.

Partial Observability

In this section, we consider how Markov decision processes can be extended to a partially observable setting where the agent can only access noisy observations Yt of its state Xt.

Alt text

Observe that a Kalman filter can be viewed as a hidden Markov model with conditional linear Gaussian motion and sensor models and a Gaussian prior on the initial state.

POMDPs can be reduced to a Markov decision process with an enlarged state space.

pai - Bayesian Optimization

Bayesian Optimization

Alt text

Exploration-Exploitation Dilemma

In Bayesian optimization, we want to learn a model of f ⋆ and use this model to optimize f ⋆ simultaneously. These goals are somewhat contrary. Learning a model of f ⋆ requires us to explore the input space while using the model to optimize f ⋆ requires us to focus on the most promising well-explored areas. This trade-off is commonly known as the exploration-exploitation dilemma.

It is common to use a so-called acquisition function to greedily pick the next point to sample based on the current model.

Online Learning and Bandits

Alt text

Multi-Armed Bandits(MAB)

The “multi-armed bandits” (MAB) problem is a classical, canonical formalization of the exploration-exploitation dilemma. In the MAB problem, we are provided with k possible actions (arms) and want to maximize our reward online within the time horizon T. We do not know the reward distributions of the actions in advance, however, so we need to trade learning the reward distribution with following the most promising action.

Bayesian optimization can be interpreted as a variant of the MAB problem where there can be a potentially infinite number of actions (arms), but their rewards are correlated (because of the smoothness of the Gaussian process prior). Smoothness means that nearby points in the input space are likely to have similar function values. Because of the smoothness property inherent in the Gaussian process prior, Bayesian optimization can make informed decisions about where to explore next in the input space. The model can leverage information from previously evaluated points to predict the behavior of the objective function at unexplored points more effectively.

Regret

The key performance metric in online learning is the regret.

Alt text

Achieving sublinear regret requires balancing exploration with exploitation. Typically, online learning (and Bayesian optimization) consider stationary environments, hence the comparison to the static optimum.

Acquisition Functions

Throughout our description of acquisition functions, we will focus on a setting where we model $f^⋆$ using a Gaussian process which we denote by f. The methods generalize to other means of learning $f^⋆$ such as Bayesian neural networks.

Alt text

One possible acquisition function is uncertainty sampling. However, this acquisition function does not at all take into account the objective of maximizing $f^⋆$ and focuses solely on exploration.

Suppose that our model f of $f^⋆$ is well-calibrated, in the sense that the true function lies within its confidence bounds. Consider the best lower bound, that is, the maximum of the lower confidence bound. Now, if the true function is really contained in the confidence bounds, it must hold that the optimum is somewhere above this best lower bound.

Therefore, we only really care how the function looks like in the regions where the upper confidence bound is larger than the best lower bound. The key idea behind the methods that we will explore is to focus exploration on these plausible maximizers.

Upper Confidence Bound

Alt text
This acquisition function naturally trades exploitation by preferring a large posterior mean with exploration by preferring a large posterior variance.
This optimization problem is non-convex in general. However, we can use approximate global optimization techniques like Lipschitz optimization (in low dimensions) and gradient ascent with random initialization (in high dimensions). Another widely used option is to sample some random points from the domain, score them according to this criterion, and simply take the best one.

Alt text
Observe that if the information gain is sublinear in T then we achieve sublinear regret and, in particular, converge to the true optimum. The term “sublinear” refers to a growth rate that is slower than linear.

Alt text
Intuitively, to work even if the unknown function $f^⋆$ is not contained in the confidence bounds, we use βt to re-scale the confidence bounds to enclose $f^⋆$.

Probability of Improvement

Alt text

Probability of improvement tends to be biased in favor of exploitation, as it prefers points with large posterior mean and small posterior variance.

Expected Improvement

Alt text

Intuitively, expected improvement seeks a large expected improvement (exploitation) while also preferring states with a large variance (exploration).

Thompson Sampling

Alt text

Probability matching is exploratory as it prefers points with larger variance (as they automatically have a larger chance of being optimal), but at the same time exploitative as it effectively discards points with low posterior mean and low posterior variance. Unfortunately, it is generally difficult to compute π analytically given a posterior. Instead, it is common to use a sampling-based approximation of π.

Alt text

In many cases, the randomness in the realizations of ̃ ft+1 is already sufficient to effectively trade exploration and exploitation.

Model Selection

Selecting a model of f ⋆ is much harder than in the i.i.d. data setting of supervised learning. There are mainly the two following dangers, • the data sets collected in active learning and Bayesian optimization are small; and • the data points are selected dependently on prior observations. This leads to a specific danger of overfitting. In particular, due to feedback loops between the model and the acquisition function, one may end up sampling the same point repeatedly.

Another approach that often works fairly well is to occasionally (according to some schedule) select points uniformly at random instead of using the acquisition function. This tends to prevent getting stuck in suboptimal parts of the state space.

difference betwwen active learning and Bayesian Optimization

Problem Setting:
Active Learning: Active learning typically deals with supervised learning tasks where there is a large pool of unlabeled instances, and the algorithm decides which instances to query for labels to improve the model.
Bayesian Optimization: Bayesian optimization deals with optimization problems where the objective function is unknown, expensive to evaluate, and possibly noisy. It aims to find the global optimum with as few evaluations as possible.

Nature of Queries:
Active Learning: In active learning, the queries are often in the form of “Which instance should be labeled next?” The goal is to select instances that will most benefit the model’s learning process.
Bayesian Optimization: In Bayesian optimization, the queries are in the form of “What point should be evaluated next in the input space to maximize/minimize the objective function?” The goal is to efficiently explore the input space and find the optimal configuration.

Algorithmic Approaches:
Active Learning: Active learning involves various strategies such as uncertainty sampling, query-by-committee, and diversity sampling to select informative instances for labeling.
Bayesian Optimization: Bayesian optimization employs probabilistic surrogate models (often Gaussian processes) to model the unknown objective function. Acquisition functions guide the search to balance exploration and exploitation efficiently.

pai - active learning

Active learning

Active learning is a machine learning paradigm in which a model is trained on a dataset that is not entirely labeled. Instead of relying solely on a fully labeled dataset for training, active learning involves an iterative process where the model actively selects the most informative instances from an unlabeled pool, queries an oracle (typically a human annotator), and adds the newly labeled instances to the training set. The model is then retrained on the expanded labeled dataset.

The key idea behind active learning is to strategically choose the most valuable instances for labeling, with the goal of improving the model’s performance while minimizing the number of labeled instances needed. This process is especially beneficial when obtaining labeled data is expensive or time-consuming.

Active learning strategies vary in how they select instances for labeling. Common strategies include uncertainty sampling (select instances where the model is uncertain), query-by-committee (select instances where different model hypotheses disagree), and diversity sampling (select instances to ensure diverse coverage of the input space).

Conditional Entropy

Alt text

Intuitively, the conditional entropy of X given Y describes our average surprise about realizations of X given a particular realization of Y, averaged over all such possible realizations of Y. In other words, conditional entropy corresponds to the expected remaining uncertainty in X after we observe Y.

Alt text
That is, the joint entropy of X and Y is given by the uncertainty about X and the additional uncertainty about Y given X.

Mutual Information

Alt text

In words, we subtract the uncertainty left about X after observing Y from our initial uncertainty about X. This measures the reduction in our uncertainty in X (as measured by entropy) upon observing Y.

Alt text

Thus, the mutual information between X and Y can be understood as the approximation error (or information loss) when assuming that X and Y are independent.

Alt text

Thus, the conditional mutual information corresponds to the reduction of uncertainty in X when observing Y, given we already observed Z.

Following our introduction of mutual information, it is natural to answer the question “where should I collect data?” by saying “wherever mutual information is maximized”.

Submodularity of Mutual Information

Alt text

Alt text

That is, “adding” x to the smaller set A yields more marginal gain than adding x to the larger set B. In other words, the function F has “diminishing returns”. In this way, submodularity can be interpreted as a notion of “concavity” for discrete functions.

Alt text

Maximizing Mutual Information

Uncertainty Sampling

Alt text

Therefore, if f is modeled by a Gaussian and we assume homoscedastic noise, greedily maximizing mutual information corresponds to simply picking the point x with the largest variance. This strategy is also called uncertainty sampling.

Heteroscedastic Noise

Uncertainty sampling is clearly problematic if the noise is heteroscedastic. If there are a particular set of inputs with a large aleatoric uncertainty dominating the epistemic uncertainty, uncertainty sampling will continuously choose those points even though the epistemic uncertainty will not be reduced substantially.
Alt text

Thus, we choose locations that trade large epistemic uncertainty with large aleatoric uncertainty. Ideally, we find a location where the epistemic uncertainty is large, and the aleatoric uncertainty is low, which promises a significant reduction of uncertainty around this location.

Classification

Here, uncertainty sampling corresponds to selecting samples that maximize the entropy of the predicted label yx.

Probabilistic Artificial Intelligence - Bayesian Deep Learning

SWAG(Stochastic Weight Averaging Gaussian)

This paper proposes a different approach to Bayesian deep learning: they use the information contained in the SGD trajectory to efficiently approximate the posterior distribution over the weights of the neural network [1].

SWA

This paper shows that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training [2].

cyclical learning rate schedule

Calibration of Modern Neural Networks

Confidence calibration – the problem of predicting probability estimates representative of the true correctness likelihood – is important for classification models in many applications. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration.

references

[1] Maddox W J, Izmailov P, Garipov T, et al. A simple baseline for bayesian uncertainty in deep learning[J]. Advances in neural information processing systems, 2019, 32.
[2] Izmailov P, Podoprikhin D, Garipov T, et al. Averaging weights leads to wider optima and better generalization[J]. arXiv preprint arXiv:1803.05407, 2018.
[3] Guo C, Pleiss G, Sun Y, et al. On calibration of modern neural networks[C]//International conference on machine learning. PMLR, 2017: 1321-1330.

Probabilistic Artificial Intelligence - Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods

The key idea of Markov chain Monte Carlo methods is to construct a Markov chain, which is efficient to simulate and has the stationary distribution p.

Markov Chains

Alt text

Intuitively, the Markov property states that future behavior is independent of past states given the present state.

Markov chains can be classified into different types based on their behavior. These classifications include time-homogeneous and time-inhomogeneous Markov chains, irreducible Markov chains, and periodic and aperiodic Markov chains.

We restrict our attention to time-homogeneous Markov chains. That is, the transition probabilities do not change over time, which can be characterized by a transition function.

Irreducible Markov chains are those in which every state can be reached from any other state, while periodic Markov chains exhibit a repeating pattern in their states. On the other hand, aperiodic Markov chains do not exhibit any regular pattern in their states. If there is no fixed period at which the probabilities return to the starting values, the chain is classified as aperiodic. Aperiodic Markov chains often display a more complex and unpredictable behavior compared to periodic ones.

Stationarity

Alt text

Alt text

Convergence

Alt text

Alt text

Alt text

A Markov Chain is called ergodic, if there exists a finite t such that every state can be reached from every state in exactly t steps.

Fundamental theorem of ergodic Markov chains

Alt text

Detailed Balance Equation

Alt text

Ergodic Theorem

Alt text

This result is the fundamental reason for why Markov chain Monte Carlo methods are possible. In practice, one observes that Markov chain Monte Carlo methods have a so-called “burn-in” time during which the distribution of the Markov chain does not yet approximate the posterior distribution well. Typically, the first t0 samples are therefore discarded. It is not clear in general how T and t0 should be chosen such that the estimator is unbiased, rather they have to be tuned.

Elementary Sampling Methods

Metropolis-Hastings Algorithm

Alt text

Gibbs Sampling

A popular example of a Metropolis-Hastings algorithm is Gibbs sampling.

Alt text

Intuitively, by re-sampling single coordinates according to the posterior distribution given the other coordinates, Gibbs sampling finds states that are successively “more” likely.

Langevin Dynamics

Gibbs Distributions

references

https://www.collimator.ai/reference-guides/what-is-a-aperiodic-markov-chain

Probabilistic Artificial Intelligence - Variational Inference

Variational Inference

The fundamental idea behind variational inference is to approximate the true posterior distribution using a “simpler” posterior that is as close as possible to the true posterior.

where λ represents the parameters of the variational posterior q, also called variational parameters.

Laplace Approximation

A natural idea for approximating the posterior distribution is to use a Gaussian approximation (that is, a second-order Taylor approximation) around the mode of the posterior. Such an approximation is known as a Laplace approximation.

As an example, we will look at Laplace approximation in the context of Bayesian logistic regression. Bayesian logistic regression corresponds to Bayesian linear regression with a Bernoulli likelihood.

Intuitively, the Laplace approximation matches the shape of the true posterior around its mode but may not represent it accurately elsewhere — often leading to extremely overconfident predictions.

Inference with a Variational

Information Theory

Surprise

The surprise about an event with probability u is defined as $S[u] = -log u$.
Alt text

Entropy

The entropy of a distribution p is the average surprise about samples from p.if the entropy of p is large, we are more uncertain about x ∼ p than if the entropy of p were low.

Cross-Entropy

Alt text
Cross-entropy can also be expressed in terms of the KL-divergence. Quite intuitively, the average surprise in samples from p with respect to the distribution q is given by the inherent uncertainty in p and the additional surprise that is due to us assuming the wrong data distribution q. The “closer” q is to the true data distribution p, the smaller is the additional average surprise.

Variational Families

We can view variational inference more generally as a family of approaches aiming to approximate the true posterior distribution by one that is closest (according to some criterion) among a “simpler” class of distributions.
Alt text

Kullback-Leibler Divergence

Alt text

In words, KL(p∥q) measures the additional expected surprise when observing samples from p that is due to assuming the (wrong) distribution q and which not inherent in the distribution p already. Intuitively, the KL-divergence measures the information loss when approximating p with q.
Alt text

Forward and reverse KL-divergence

KL(p∥q) is also called the forward (or inclusive) KL-divergence. In contrast, KL(q∥p) is called the reverse (or exclusive) KL-divergence.

It can be seen that the reverse KL-divergence tends to greedily select the mode and underestimating the variance which, in this case, leads to an overconfident prediction. The forward KL-divergence, in contrast, is more conservative and yields what one could consider the “desired” approximation.

The reverse KL-divergence is typically used in practice. This is for computational reasons. In principle, if the variational family Q is “rich enough”, minimizing the forward and reverse KL-divergences will yield the same result.

There is a nice interpretation of minimizing the forward KullbackLeibler divergence of the approximation $q_\lambda$ with respect to the true distribution p as being equivalent to maximizing the marginal model likelihood on a sample of infinite size. This interpretation is not useful in the setting where p is a posterior distribution over model parameters θ as a maximum likelihood estimate requires samples from p which we cannot obtain in this case.

KL-divergence of Gaussians

Alt text

Alt text

Alt text

Evidence Lower Bound

The Evidence Lower Bound is a quantity maximization of which is equivalent to minimizing the KL-divergence. As its name suggests, the evidence lower bound is a (uniform) lower bound to the log-evidence $log p(y{1:n}|x{1:n})$.
Alt text

Alt text

Note that this inequality lower bounds the logarithm of an integral by an expectation of a logarithm over some variational distribution q. Hence, the ELBO is a family of lower bounds — one for each variational distribution. Such inequalities are called variational inequalities.

Alt text

Gradient of Evidence Lower Bound

Alt text
Obtaining the gradient of the evidence lower bound is more difficult. This is because the expectation integrates over the measure $q_\lambda$, which depends on the variational parameters $\lambda$. Thus, we cannot move the gradient operator inside the expectation.

There are two main techniques which are used to rewrite the gradient in such a way that Monte Carlo sampling becomes possible. The first is to use score gradients. The second is the so-called reparameterization trick.

reparameterization trick

Alt text
Alt text

Alt text

The procedure of approximating the true posterior using a variational posterior by maximizing the evidence lower bound using stochastic optimization is also called black box stochastic variational inference. The only requirement is that we can obtain unbiased gradient estimates from the evidence lower bound (and the likelihood).

You need to set client_id and slot_id to show this AD unit. Please set it in _config.yml.