Probabilistic Artificial Intelligence - Bayesian Linear Regression

Bayesian Linear Regression

Linear Regression

Given a set of (x, y) pairs, linear regression aims to find a linear model that fits the data optimally.
Given the linear model $y=w^Tx$, we want to find optimal weights $w$.
There are many ways of estimating w from data, the most common being the least squares estimator:

A slightly different estimator is used for ridge regression:

As the formula shows, the squared $l_2$ regularization term $\lambda|w|_2^2$ penalizes large $w$ and thus reduces the complexity of the resulting model,so Ridge regression is more robust than standard linear regression in the presence of multicollinearity. Multicollinearity occurs when multiple independent inputs are highly correlated. In this case, their individual effects on the predicted variable cannot be estimated well. Classical linear regression is highly volatile to small input changes. The regularization of ridge regression reduces this volatility by introducing a bias on the weights towards 0.

uncertainty

In practice, our data D is merely a sample of the process we are modeling. In these cases, we are looking for models that generalize to unseen data.
Therefore, it is useful to express the uncertainty about our model due to the lack of data. This uncertainty is commonly referred to as the epistemic uncertainty.
Usually, there is another source of uncertainty called aleatoric uncertainty, which originates directly from the process that we are modeling. This uncertainty is the noise in the labels that cannot be explained by the inputs.

Weight-space View

The most immediate and natural Bayesian interpretation of linear regression is to simply impose a prior on the weights $w$.

Assume that prior $w\sim\mathcal{N}(0,\sigma_p^2I)$ and likelihood $y_i\mid x_i,w\sim\mathcal{N}(w^\top x_i,\sigma_n^2).$ are both Gaussian, we will get the posterior distribution over the weights as:

As the above is a quadratic form in $w$, it follows that the posterior distribution is a Gaussian.
This also shows that Gaussians with known variance and linear likelihood are self-conjugate. A distribution is said to be self-conjugate (or a conjugate prior is self-conjugate) if, when used as a prior distribution, it results in a posterior distribution that belongs to the same family of distributions. It can be shown more generally that Gaussians with known variance are self-conjugate to any Gaussian likelihood. For general distributions the posterior will not be closed-form. This is a very special property of Gaussians.
We can compute the MAP estimate for the weights,

we can find that this is simply the MLE loss with an additional $l_2$ regularization term and this coincides with the optimization objective of ridge regression with weight decay $\lambda=\frac{\sigma_n^2}{\sigma_p^2}$. Also, recall that the MAP estimate corresponds to the mode of the posterior distribution, which in the case of a Gaussian is simply its mean.(The mode of a probability distribution is the value where the distribution reaches its maximum point. In the context of the posterior distribution, the mode corresponds to the most probable value of the parameter given the observed data.).

A commonly used alternative to ridge regression is the least absolute shrinkage and selection operator (or lasso), which uses $l_1$ regularization.It turns out that lasso can also be viewed as Bayesian learning, using a Laplace prior.

Note that using point estimates like the MAP estimate does not quantify uncertainty in the weights. The MAP estimate simply collapses all mass of the posterior around its mode.This is especially harmful when we are unsure about the best model.

Recursive Bayesian Updates

As data arrives online (i.e., in “real-time”), we can obtain the new posterior and use it to replace our prior.

Non-linear Regression

We can use linear regression not only to learn linear functions. The trick is to apply a non-linear transformation $\phi:\mathbb{R}^d\to\mathbb{R}^e$.
In Polynomial regression, to learn polynomials of degree m in d input dimensions, we need to apply the nonlinear transformation

the dimension of the feature space grows exponentially in the degree of polynomials and input dimensions. Even for relatively small m and d, this becomes completely unmanageable.

Function-space View

Previously, we have been interpreting it as a distribution over the weights $w$ of a linear function $\hat{f}=\boldsymbol{\Phi}w.$, we can equivalently consider a distribution directly over the estimated function values. Instead of considering a prior over the weights${w\sim\mathcal{N}}(0,\sigma_p^2I)$, we now impose a prior directly on the values of our model at the observations.Using that Gaussians are closed under linear maps, we obtain the equivalent prior:

K is the kernel matrix.The kernel function is:
$k(x,x^{\prime})\doteq\sigma_p^2\cdot\phi(x)^\top\phi(x^{\prime})$
The kernel matrix is a covariance matrix and the kernel function measures the covariance of the function values $k(x,x^{\prime})=\mathrm{Cov}\big[f(x),f(x^{\prime})\big].$
Moreover, note that we have reformulated the learning algorithm such that the feature space is now implicit in the choice of kernel, and the kernel is defined by inner products of (nonlinearly transformed) inputs. In other words, the choice of kernel implicitly determines the class of functions that f is sampled from, and encodes our prior beliefs. This is known as the kernel trick.

reference

[1] A. Krause, “Probabilistic Artificial Intelligence”.

Probabilistic Artificial Intelligence - Gaussian Process

Gaussian Process

Gaussian distribution

Univariate Gaussian Distribution is simple, here we focus on multivariate graussian distribution, where each random variable is distributed normally and their joint distribution is also Gaussian. The multivariate Gaussian distribution is defined by a mean vector $\mu$ and a covariance matrix $\Sigma$. The covariance matrix is always symmetric and positive semi-definite. why is Gaussian distribution so important? because under the assumptions of the central limit theorem, we can use it to model many events in the real world. Moreover, Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. Conditioning is the cornerstone of Gaussian processes since it allows Bayesian inference.

Grassian process

what is GP

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.
A Gaussian process is characterized by a mean function $\mu$ and a covariance function (or kernel function) k. Intuitively, a Gaussian process can be interpreted as a normal distribution over functions and is therefore often called an infinite-dimensional Gaussian.

Here’s an analogy: Consider a multivariate normal distribution over a set of points in 2D space. Each draw from this distribution corresponds to a vector of values, one for each point. Now, extend this idea to an infinite number of points, and you get a function. The Gaussian process is like having a normal distribution over all possible functions that could describe your data.

Mean and covariance functions

The prior mean function $m(⋅)$ describes the average function under the GP distribution before seeing any data. Therefore, it offers a straightforward way to incorporate prior knowledge about the function we wish to model. In the absence of this type of prior knowledge, a common choice is to set the prior mean function to zero, i.e., $m(⋅)≡0$.

The covariance function $k(x,x’)$ computes the covariance $cov[f(x),f(x′)]$ between the corresponding function values by evaluating the covariance function
k at the corresponding inputs x,x′(kernel trick ).Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability. Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability.

How GP works

For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. The goal of Gaussian processes is to learn this underlying distribution from training data. Respective to the test data X, we will denote the training data as Y. As we have mentioned before, the key idea of Gaussian processes is to model the underlying distribution of
X together with Y as a multivariate normal distribution. The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available. In the case of Gaussian processes, this information is the training data. Thus, we are interested in the conditional probability
$P(X∣Y)$.

In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at $∣X∣=N$ test points, the corresponding multivariate Gaussian distribution is also
N -dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the i-th component of the resulting vector as the function value corresponding to the i-th test point.

Marginal likelihood and GP training

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

The likelihood function represents the probability of observing the given data X given a specific set of parameter values $\theta$ in a statistical model. It expresses how well the parameters explain the observed data. The likelihood function is a key component in frequentist statistics. It is used to estimate the maximum likelihood estimates (MLE) of the parameters.

The marginal likelihood represents the probability of observing the data X without specifying a particular set of parameters. It is obtained by integrating (or summing) the likelihood function over all possible values of the parameters, weighted by the prior distribution of the parameters. The marginal likelihood is a key concept in Bayesian statistics. It serves as a normalizing constant, ensuring that the posterior distribution integrates (or sums) to 1. It is also used in Bayesian model comparison, where different models are compared based on their marginal likelihoods.

To train the GP, we maximize the marginal likelihood with respect to the GP hyperparameters,i.e., the parameters of the mean and covariance functions, which we summarize by $\theta$.

Maximizing the marginal likelihood behaves much better than finding maximum likelihood $\operatorname*{argmax}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})$ or maximum a-posteriori point estimates $\mathop{\mathrm{argmax}}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})p(f(\mathbf{X})|\theta)$.
These two approaches would lead to overfitting, since it is possible to get arbitrarily high likelihoods by placing the function values $f(X)$ on top of the observations y and letting the the noise $\sigma_n$ tend to zero. In contrast, the marginal likelihood does not fit function values directly, but integrates them out.By averaging (integrating out) the direct model parameters, i.e., the function values, the marginal likelihood automatically trades off data fit and model complexity.Choose a model that is too inflexible, and the marginal likelihood
$p(y∣X,θ)$ will be low because few functions in the prior fit the data. A model that is too flexible spreads its density over too many datasets, and so $p(y∣X,θ)$ will also be low.

what is kernel

If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by k(x, x′); this is sometimes called the kernel trick. This technique is kernel trick particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves.

The kernel k, which is often also called covariance function, pairwise on all the points. The kernel receives two points
$t,t’ \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar.

We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix.

Kernels can be separated into stationary and non-stationary kernels. Stationary kernels, such as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only dependent on their relative position. Non-stationary kernels, such as the linear kernel, do not have this constraint and depend on an absolute location.

The kernel is used to define the entries of the covariance matrix. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable.

A kernel function coulde be stationary or isotropic. A kernel function is stationary if $k(x,x’)=k(x-x’)$. A kernel function is isotropic is $k(x,x’)=k(||x-x’||_2)$. Stationarity implies that the covariance function only depends on distances
$∥x−x’∥$ of the corresponding inputs, and not on the location of the individual data points. This means that if the inputs are close to each other, the corresponding function values are strongly correlated.

Interpretation of the hyperparameters

Stationary covariance functions typically contain the term $\frac\tau l=\frac{|\mathbf{x}-\mathbf{x}^{\prime}|}l$. where
$l$ is a lengthscale parameter. Longer lengthscales cause long-range correlations, whereas for short lengthscales, function values are strongly correlated only if their respective inputs are very close to each other. This allows functions to vary strongly and display more flexibility in the range of the data.

The signal variance parameter $\sigma_f^2$ allows us to say something about the amplitude of the function we model.

training tips

The marginal likelihood is non-convex with potentially multiple local optima. Therefore, we may end up in (bad) local optima when we choose a gradient-based optimization method. In order to initialize these parameters to reasonable values when we optimize the marginal likelihood, we need to align them with what we know about the data, either empirically or using prior knowledge. Assume, we have training inputs
X and training targets y. We will see that the signal and noise variances can be initialized using statistics of the training targets, whereas the lengthscale parameters can be initialized using statistics of the training inputs. A reasonable intialization that works well in practice is to set the signal variance to the empirical variance of the observed function values, and the noise variance to a smaller value.

Local optima are the largest problem that prevent good lengthscales from being selected through gradient-based optimisation. Generally, we can observe two different types of local optima:

Long lengthscale, large noise. Often the lengthscale is so long that the prior only allows nearly linear functions in the posterior. As a consequence, a large amount of noise is required to account for the residuals, leading to a small signal-to-noise ratio. This looks like underfitting, as non-linearities in the data are modelled as noise instead of being learned as part of the function.
Short lengthscale, low noise. Short lengthscales allow the posterior mean to fit to small variations in the data. Often such solutions are accompanied by small noise, and therefore a high signal-to-noise ratio. Such solutions look like they overfit, since the means fit the data by making drastic and fast changes, while generalizing poorly. However, the short lengthscale also prevents the predictive error bars from being small, so all predictions will be made with high uncertainty. In the probabilistic sense, this also looks like underfitting.

Which optimum we end up in, depends on the initialization of our lengthscale as we are likely to end up in a local optimum nearest to our initial choice. In both cases, the optimizer is more likely to get stuck in a local optimum if the situations are a somewhat plausible explanations of the data. In practice, it is harder to get out of a long lengthscale situation since the optimizer often struggles to get beyond the (typically) huge plateau that is typical for very long lengthscales.

How to choose a kernel

The choice of kernel (a.k.a. covariance function) determines almost all the generalization properties of a GP model.
​In fact, you might decide that choosing the kernel is one of the main difficulties in doing inference - and just as you don’t know what the true parameters are, you also don’t know what the true kernel is. Probably, you should try out a few different kernels at least, and compare their marginal likelihood on your training data.

others

The GP does not require any input normalization, but it can make sense to do so for numerical reasons.

reference

https://distill.pub/2019/visual-exploration-gaussian-processes/
https://www.cs.toronto.edu/~duvenaud/cookbook/ \
https://infallible-thompson-49de36.netlify.app/ \
A. Krause, “Probabilistic Artificial Intelligence”.

speech

signal

spectrogram: A spectrogram of a time signal is a special two-dimensional representation that displays time in its horizontal axis and frequency in its vertical axis.

short-time Fourier analysis

Why use it?
Some regions of speech signals shorter than 100 milliseconds often appear to be periodic, so that we can use the exact definition of Fourier
transform.

spectral leakage

This phenomenon is called spectral leakage because the amplitude of one harmonic leaks over the rest and masks its value.

feature extraction

Representation of speech signals in the frequency domain is especially useful because the frequency structure of a phoneme is generally unique.

Sinusoids are important because speech signals can be decomposed as sums of sinusoids.

For voiced sounds there is typically more energy at low frequencies
than at high frequencies, also called roll-off. To make the spectrograms easier to read, sometimes the signal is first preemphasized (typically with a first-order difference FIR filter) to boost the high frequencies
to counter the roll-off of natural speech.

Digital Systems

Linear Time-Invariant Systems and Linear Time-Varying Systems.

The Fourier Transform

Z-Transform

digital filter

filterbank

A filterbank is a collection of filters that span the whole frequency spectrum.

short-time analysis

Network Representation Learning

background

Recording studying during KTH. First blog about network representation learning, a project belongs to machine learning, advanced course.

LINE

Reproduce paper “LINE: Large-scale Information Network Embedding”.

Alias Table Method

It’s a method of effiently drawing samples from discrete distribution.
reference:
https://www.keithschwarz.com/darts-dice-coins/
https://blog.csdn.net/haolexiao/article/details/65157026

Negative Sampling

word2vec

Original paper:
Efficient estimation of word representations in vector space.
reference:
word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method

Skip-Gram Model

Original papaer:Distributed Representations of Words and Phrases
and their Compositionality.
The idea behind the word2vec models is that the words that appear in the same context (near each other) should have similar word vectors. Therefore, we should consider some notion of similarity in our objective when training the model. This is done using the dot product since when vectors are similar, their dot product is larger.
reference:
https://www.baeldung.com/cs/nlps-word2vec-negative-sampling

graphSage

机器学习(4)-uplift模型

uplift模型

uplift模型中文为增益模型,是工业界因果推断与机器学习结合最成熟的算法之一。传统的监督学习模型,往往是对输入x去预测一个y,而增益模型注重于x的变化对y的影响,以广告为例,传统的监督学习往往是给定这个特征去预测用户是否会点击,而增益模型注重的是给这个客户投放广告与否对客户是否购买广告商品所产生的影响。

因果推断

因果推断是从观察到的数据中推断出变量之间的因果关系的过程。在统计学和数据科学中,因果推断涉及到尝试理解一个事件或行为是什么导致了另一个事件或行为。这与相关性或关联不同,因果推断试图确定一个变量的变化是否直接导致另一个变量的变化。

深度学习(1)-梯度下降

梯度下降

梯度下降法的基本思想是沿着函数梯度的负方向移动,直到找到一个局部最小值点。

利用泰勒展开,我们可以得到

如果取$\epsilon = -\eta f’(x)$, 我们会有

因为$\eta f’^2(x)>0$, 当$\eta$足够小,我们会有.
则通过$x \leftarrow x - \eta f’(x)$可以使得$f(x)$的值下降。

牛顿法

牛顿法是一种基于二阶泰勒展开的优化方法,利用Hessian矩阵来加速收敛。

通过泰勒展开,我们有

$f(x)$为最小值时,我们有$f’(x)=0$,则有

随机梯度下降

对于梯度下降,如果是n个样本的训练数据集,我们就需要计算n次每个样本的损失梯度,
计算负担会随着n的升高而变重,因此就有了随机梯度下降,即每次计算梯度时,只是随机抽取
一个样本进行计算。
由于
可以认为平均而言,随机梯度是对梯度的良好估计。

小批量随机梯度下降

小批量随机梯度下降和随机梯度下降所用的所有元素都是从训练集中随机抽取的,因此梯度的期望保持不变,但是方差显著降低了。
一般来说,小批量随机梯度下降比随机梯度下降和梯度下降的速度快,收敛风险较小。

动量法

动量法是一种优化算法,它通过对过去的梯度进行加权平均,来改进梯度下降法的性能。动量法的基本思想是在每次迭代时,不仅考虑当前的梯度方向,还考虑过去梯度的方向,从而平滑了更新方向,有助于加速收敛并减少振荡。

AdaGrad

AdaGrad 的核心在于动态调整每个参数的学习率。每个参数的学习率是根据该参数的历史梯度的平方和来调整的。如果某个参数的历史梯度变化较大,那么它的学习率会相应减小;反之,如果某个参数的历史梯度变化较小,那么它的学习率会相应增大。

RMSProp

RMSprop(Root Mean Square Propagation)是一种自适应学习率优化算法,旨在克服AdaGrad等方法中学习率递减过快的问题。RMSprop通过使用指数加权移动平均(Exponential Moving Average, EMA)来计算梯度平方的平均值,从而动态调整每个参数的学习率。这种方法在一定程度上解决了AdaGrad累积历史梯度平方和导致学习率过早减小的问题。

Adadelta

Adam

Adam 结合了 AdaGrad 和 RMSprop 的优点,同时解决了它们各自的缺点。Adam 通过使用梯度的一阶矩估计(即动量)和二阶矩估计(即 RMSprop 中的梯度平方的指数加权移动平均值)来动态调整每个参数的学习率。

参考文献

https://zh.d2l.ai/chapter_optimization/gd.html

强化学习(1)-马尔可夫决策过程

马尔可夫决策过程 Markov Decision Process (MDP)

马尔可夫决策过程(Markov Decision Process,简称 MDP)是强化学习中的一个重要概念,用于描述智能体在不确定环境下做出决策的数学模型。

MDP有几大要素:状态空间,动作空间,状态转移函数,奖励函数,折扣因子。

马尔可夫性质:当前状态可以完全表征过程。

MDP 的目标是找到一个最优策略,使得从任何初始状态出发,智能体在未来获得的累积奖励(回报)最大化。对于任意有限的马尔可夫决策过程,都存在一个最优策略,不差于其他所有可能的策略。

价值迭代

价值函数

This is called the “value function” for the policy $\pi$:

The key idea behind all algorithms in reinforcement learning is that the value of state can be written as the average reward obtained in the first stage and the value function averaged over all possible next states.

动作价值函数

This is defined to be the average return of a trajectory that begins at $s_0$
but when the action of the first stage is fixed to be $a_0$.

最优策略

价值迭代

The main idea behind the Value Iteration algorithm is to use the principle of dynamic programming to find the optimal average return obtained from a given state. Note that implementing the Value Iteration algorithm requires that we know the Markov decision process (MDP), e.g., the transition and reward functions, completely.

贝尔曼方程

贝尔曼方程(Bellman Equation)是强化学习和动态规划领域的一个重要概念,它用来描述在一个马尔可夫决策过程(MDP)中,状态价值函数或动作价值函数之间的关系。

Q-learning

Q-learning is an algorithm to learn the value function without necessarily knowing the MDP.
Q-learning 是一种强化学习算法,它属于无模型学习方法的一部分,因为 Q-learning 不需要了解环境的确切模型就能学习。Q-learning 的目标是学习一个动作价值函数, 更新规则是基于贝尔曼方程的。

为了学习这个动作价值函数,定义了一个损失函数, 该损失函数度量了给定
Q-函数与通过贝尔曼方程所预测的理想
Q-函数之间的差异。

我们可以通过梯度下降来最小化这个损失函数。

参考文献

https://d2l.ai/chapter_reinforcement-learning/value-iter.html

https://d2l.ai/chapter_reinforcement-learning/qlearning.html

概率论相关(1)-先验概率、后验概率与似然

先验概率、后验概率与似然

今天看到一个比较好的关于先验、后验和似然的通俗解释,先验概率就是基于历史数据的统计经验,后验概率是在已知结果发生时推断原因的概率,似然概率是已知原因推断结果的概率。

根据上述解释,假设我们有一个数据集,这个数据集服从某一种分布,也可以理解为是一个黑盒子模型,黑盒子模型里面包含了很多参数,则似然概率就是已知参数得到某样本的概率,后验概率就是已知某样本得到参数的概率。

为了更理解这一概念,再来看一下著名的贝叶斯公式:

其中$p(\theta)$是先验概率, $P(\theta \mid x)$是后验概率,$p(x \mid \theta)$是似然函数。

这里区分一下两个概念,对于$p(x \mid \theta)$如果$\theta$已知且不变,x是变量,则此函数称为概率函数,而如果x已知且保持不变,$\theta$是变量,则此函数称为似然函数。

最大似然估计(MLE)

最大似然,也就是说要让似然最大,则在数据集上的学习过程就是求模型参数使得当前观察到的样本概率最大,所以最大似然估计的目的就是根据已知的样本结果,反推最有可能导致这个结果的参数值。最大似然估计的适用场景是”模型已定、参数未知”,一个重要前提是样本集中的样本都是独立同分布的随机变量,因为只有独立同分布,样本集的似然函数才能等于各样本似然函数的乘积。

假设一个用于学习的样本集是:$D=\left{x{1}, x{2}, \cdots, x{N}\right}$,来估计参数向量θ,则$l(\theta)=p(D \mid \theta)=p\left(x{1}, x{2}, \cdots, x{N} \mid \theta\right)=\prod{i=1}^{N} p\left(x{i} \mid \theta\right)$,则使得似然函数最大的参数值求解过程为:

最大后验估计(MAP)

最大后验估计与最大似然估计的不同之处在于最大后验估计中引入了先验概率,因此结合贝叶斯公式和最大似然估计,最大后验估计就转化为了:

L2正则就是加入了高斯先验,L1正则就是加入了拉普拉斯先验。

贝叶斯估计

在MLE和MAP中,都是假设模型参数$\theta$未知,但都是固定的值,属于点估计,而在贝叶斯估计中,假设模型参数是未知的随机变量,而不是确定值,最终得到的参数不是具体的值,而是一个分布,然后用这个分布的期望来作为最终的参数值。

总结

最后让我们用大佬讲义中的片段总结一下本篇的主要内容:

参考资料

贝叶斯估计、最大似然估计、最大后验概率估计

极大似然估计、最大后验估计