Probabilistic Artificial Intelligence - Bayesian Linear Regression

Bayesian Linear Regression

Linear Regression

Given a set of (x, y) pairs, linear regression aims to find a linear model that fits the data optimally.
Given the linear model $y=w^Tx$, we want to find optimal weights $w$.
There are many ways of estimating w from data, the most common being the least squares estimator:

A slightly different estimator is used for ridge regression:

As the formula shows, the squared $l_2$ regularization term $\lambda|w|_2^2$ penalizes large $w$ and thus reduces the complexity of the resulting model,so Ridge regression is more robust than standard linear regression in the presence of multicollinearity. Multicollinearity occurs when multiple independent inputs are highly correlated. In this case, their individual effects on the predicted variable cannot be estimated well. Classical linear regression is highly volatile to small input changes. The regularization of ridge regression reduces this volatility by introducing a bias on the weights towards 0.

uncertainty

In practice, our data D is merely a sample of the process we are modeling. In these cases, we are looking for models that generalize to unseen data.
Therefore, it is useful to express the uncertainty about our model due to the lack of data. This uncertainty is commonly referred to as the epistemic uncertainty.
Usually, there is another source of uncertainty called aleatoric uncertainty, which originates directly from the process that we are modeling. This uncertainty is the noise in the labels that cannot be explained by the inputs.

Weight-space View

The most immediate and natural Bayesian interpretation of linear regression is to simply impose a prior on the weights $w$.

Assume that prior $w\sim\mathcal{N}(0,\sigma_p^2I)$ and likelihood $y_i\mid x_i,w\sim\mathcal{N}(w^\top x_i,\sigma_n^2).$ are both Gaussian, we will get the posterior distribution over the weights as:

As the above is a quadratic form in $w$, it follows that the posterior distribution is a Gaussian.
This also shows that Gaussians with known variance and linear likelihood are self-conjugate. A distribution is said to be self-conjugate (or a conjugate prior is self-conjugate) if, when used as a prior distribution, it results in a posterior distribution that belongs to the same family of distributions. It can be shown more generally that Gaussians with known variance are self-conjugate to any Gaussian likelihood. For general distributions the posterior will not be closed-form. This is a very special property of Gaussians.
We can compute the MAP estimate for the weights,

we can find that this is simply the MLE loss with an additional $l_2$ regularization term and this coincides with the optimization objective of ridge regression with weight decay $\lambda=\frac{\sigma_n^2}{\sigma_p^2}$. Also, recall that the MAP estimate corresponds to the mode of the posterior distribution, which in the case of a Gaussian is simply its mean.(The mode of a probability distribution is the value where the distribution reaches its maximum point. In the context of the posterior distribution, the mode corresponds to the most probable value of the parameter given the observed data.).

A commonly used alternative to ridge regression is the least absolute shrinkage and selection operator (or lasso), which uses $l_1$ regularization.It turns out that lasso can also be viewed as Bayesian learning, using a Laplace prior.

Note that using point estimates like the MAP estimate does not quantify uncertainty in the weights. The MAP estimate simply collapses all mass of the posterior around its mode.This is especially harmful when we are unsure about the best model.

Recursive Bayesian Updates

As data arrives online (i.e., in “real-time”), we can obtain the new posterior and use it to replace our prior.

Non-linear Regression

We can use linear regression not only to learn linear functions. The trick is to apply a non-linear transformation $\phi:\mathbb{R}^d\to\mathbb{R}^e$.
In Polynomial regression, to learn polynomials of degree m in d input dimensions, we need to apply the nonlinear transformation

the dimension of the feature space grows exponentially in the degree of polynomials and input dimensions. Even for relatively small m and d, this becomes completely unmanageable.

Function-space View

Previously, we have been interpreting it as a distribution over the weights $w$ of a linear function $\hat{f}=\boldsymbol{\Phi}w.$, we can equivalently consider a distribution directly over the estimated function values. Instead of considering a prior over the weights${w\sim\mathcal{N}}(0,\sigma_p^2I)$, we now impose a prior directly on the values of our model at the observations.Using that Gaussians are closed under linear maps, we obtain the equivalent prior:

K is the kernel matrix.The kernel function is:
$k(x,x^{\prime})\doteq\sigma_p^2\cdot\phi(x)^\top\phi(x^{\prime})$
The kernel matrix is a covariance matrix and the kernel function measures the covariance of the function values $k(x,x^{\prime})=\mathrm{Cov}\big[f(x),f(x^{\prime})\big].$
Moreover, note that we have reformulated the learning algorithm such that the feature space is now implicit in the choice of kernel, and the kernel is defined by inner products of (nonlinearly transformed) inputs. In other words, the choice of kernel implicitly determines the class of functions that f is sampled from, and encodes our prior beliefs. This is known as the kernel trick.

reference

[1] A. Krause, “Probabilistic Artificial Intelligence”.

Probabilistic Artificial Intelligence - Gaussian Process

Gaussian Process

Gaussian distribution

Univariate Gaussian Distribution is simple, here we focus on multivariate graussian distribution, where each random variable is distributed normally and their joint distribution is also Gaussian. The multivariate Gaussian distribution is defined by a mean vector $\mu$ and a covariance matrix $\Sigma$. The covariance matrix is always symmetric and positive semi-definite. why is Gaussian distribution so important? because under the assumptions of the central limit theorem, we can use it to model many events in the real world. Moreover, Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. Conditioning is the cornerstone of Gaussian processes since it allows Bayesian inference.

Grassian process

what is GP

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.
A Gaussian process is characterized by a mean function $\mu$ and a covariance function (or kernel function) k. Intuitively, a Gaussian process can be interpreted as a normal distribution over functions and is therefore often called an infinite-dimensional Gaussian.

Here’s an analogy: Consider a multivariate normal distribution over a set of points in 2D space. Each draw from this distribution corresponds to a vector of values, one for each point. Now, extend this idea to an infinite number of points, and you get a function. The Gaussian process is like having a normal distribution over all possible functions that could describe your data.

Mean and covariance functions

The prior mean function $m(⋅)$ describes the average function under the GP distribution before seeing any data. Therefore, it offers a straightforward way to incorporate prior knowledge about the function we wish to model. In the absence of this type of prior knowledge, a common choice is to set the prior mean function to zero, i.e., $m(⋅)≡0$.

The covariance function $k(x,x’)$ computes the covariance $cov[f(x),f(x′)]$ between the corresponding function values by evaluating the covariance function
k at the corresponding inputs x,x′(kernel trick ).Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability. Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability.

How GP works

For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. The goal of Gaussian processes is to learn this underlying distribution from training data. Respective to the test data X, we will denote the training data as Y. As we have mentioned before, the key idea of Gaussian processes is to model the underlying distribution of
X together with Y as a multivariate normal distribution. The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available. In the case of Gaussian processes, this information is the training data. Thus, we are interested in the conditional probability
$P(X∣Y)$.

In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at $∣X∣=N$ test points, the corresponding multivariate Gaussian distribution is also
N -dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the i-th component of the resulting vector as the function value corresponding to the i-th test point.

Marginal likelihood and GP training

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

The likelihood function represents the probability of observing the given data X given a specific set of parameter values $\theta$ in a statistical model. It expresses how well the parameters explain the observed data. The likelihood function is a key component in frequentist statistics. It is used to estimate the maximum likelihood estimates (MLE) of the parameters.

The marginal likelihood represents the probability of observing the data X without specifying a particular set of parameters. It is obtained by integrating (or summing) the likelihood function over all possible values of the parameters, weighted by the prior distribution of the parameters. The marginal likelihood is a key concept in Bayesian statistics. It serves as a normalizing constant, ensuring that the posterior distribution integrates (or sums) to 1. It is also used in Bayesian model comparison, where different models are compared based on their marginal likelihoods.

To train the GP, we maximize the marginal likelihood with respect to the GP hyperparameters,i.e., the parameters of the mean and covariance functions, which we summarize by $\theta$.

Maximizing the marginal likelihood behaves much better than finding maximum likelihood $\operatorname*{argmax}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})$ or maximum a-posteriori point estimates $\mathop{\mathrm{argmax}}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})p(f(\mathbf{X})|\theta)$.
These two approaches would lead to overfitting, since it is possible to get arbitrarily high likelihoods by placing the function values $f(X)$ on top of the observations y and letting the the noise $\sigma_n$ tend to zero. In contrast, the marginal likelihood does not fit function values directly, but integrates them out.By averaging (integrating out) the direct model parameters, i.e., the function values, the marginal likelihood automatically trades off data fit and model complexity.Choose a model that is too inflexible, and the marginal likelihood
$p(y∣X,θ)$ will be low because few functions in the prior fit the data. A model that is too flexible spreads its density over too many datasets, and so $p(y∣X,θ)$ will also be low.

what is kernel

If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by k(x, x′); this is sometimes called the kernel trick. This technique is kernel trick particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves.

The kernel k, which is often also called covariance function, pairwise on all the points. The kernel receives two points
$t,t’ \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar.

We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix.

Kernels can be separated into stationary and non-stationary kernels. Stationary kernels, such as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only dependent on their relative position. Non-stationary kernels, such as the linear kernel, do not have this constraint and depend on an absolute location.

The kernel is used to define the entries of the covariance matrix. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable.

A kernel function coulde be stationary or isotropic. A kernel function is stationary if $k(x,x’)=k(x-x’)$. A kernel function is isotropic is $k(x,x’)=k(||x-x’||_2)$. Stationarity implies that the covariance function only depends on distances
$∥x−x’∥$ of the corresponding inputs, and not on the location of the individual data points. This means that if the inputs are close to each other, the corresponding function values are strongly correlated.

Interpretation of the hyperparameters

Stationary covariance functions typically contain the term $\frac\tau l=\frac{|\mathbf{x}-\mathbf{x}^{\prime}|}l$. where
$l$ is a lengthscale parameter. Longer lengthscales cause long-range correlations, whereas for short lengthscales, function values are strongly correlated only if their respective inputs are very close to each other. This allows functions to vary strongly and display more flexibility in the range of the data.

The signal variance parameter $\sigma_f^2$ allows us to say something about the amplitude of the function we model.

training tips

The marginal likelihood is non-convex with potentially multiple local optima. Therefore, we may end up in (bad) local optima when we choose a gradient-based optimization method. In order to initialize these parameters to reasonable values when we optimize the marginal likelihood, we need to align them with what we know about the data, either empirically or using prior knowledge. Assume, we have training inputs
X and training targets y. We will see that the signal and noise variances can be initialized using statistics of the training targets, whereas the lengthscale parameters can be initialized using statistics of the training inputs. A reasonable intialization that works well in practice is to set the signal variance to the empirical variance of the observed function values, and the noise variance to a smaller value.

Local optima are the largest problem that prevent good lengthscales from being selected through gradient-based optimisation. Generally, we can observe two different types of local optima:

Long lengthscale, large noise. Often the lengthscale is so long that the prior only allows nearly linear functions in the posterior. As a consequence, a large amount of noise is required to account for the residuals, leading to a small signal-to-noise ratio. This looks like underfitting, as non-linearities in the data are modelled as noise instead of being learned as part of the function.
Short lengthscale, low noise. Short lengthscales allow the posterior mean to fit to small variations in the data. Often such solutions are accompanied by small noise, and therefore a high signal-to-noise ratio. Such solutions look like they overfit, since the means fit the data by making drastic and fast changes, while generalizing poorly. However, the short lengthscale also prevents the predictive error bars from being small, so all predictions will be made with high uncertainty. In the probabilistic sense, this also looks like underfitting.

Which optimum we end up in, depends on the initialization of our lengthscale as we are likely to end up in a local optimum nearest to our initial choice. In both cases, the optimizer is more likely to get stuck in a local optimum if the situations are a somewhat plausible explanations of the data. In practice, it is harder to get out of a long lengthscale situation since the optimizer often struggles to get beyond the (typically) huge plateau that is typical for very long lengthscales.

How to choose a kernel

The choice of kernel (a.k.a. covariance function) determines almost all the generalization properties of a GP model.
​In fact, you might decide that choosing the kernel is one of the main difficulties in doing inference - and just as you don’t know what the true parameters are, you also don’t know what the true kernel is. Probably, you should try out a few different kernels at least, and compare their marginal likelihood on your training data.

others

The GP does not require any input normalization, but it can make sense to do so for numerical reasons.

reference

https://distill.pub/2019/visual-exploration-gaussian-processes/
https://www.cs.toronto.edu/~duvenaud/cookbook/ \
https://infallible-thompson-49de36.netlify.app/ \
A. Krause, “Probabilistic Artificial Intelligence”.

机器学习(5)--不平衡的分类问题

什么是不平衡的分类问题

在机器学习中,不平衡的分类问题指的是类别之间的样本分布不均匀,其中某一类的样本数量远远超过另一类。这种不平衡可能会对模型训练和性能评估产生影响,因为模型可能更倾向于预测样本数更多的类别,而对样本数较少的类别进行较差的预测。

如何解决

为了解决不平衡分类问题,可以考虑以下方法:增加少数类别的样本数或减少多数类别的样本数,以平衡类别分布。这包括上采样(增加少数类别样本)和下采样(减少多数类别样本)。调整分类阈值,使模型更倾向于识别少数类别。这可以通过调整模型输出的概率阈值来实现。

机器学习(4)-uplift模型

uplift模型

uplift模型中文为增益模型,是工业界因果推断与机器学习结合最成熟的算法之一。传统的监督学习模型,往往是对输入x去预测一个y,而增益模型注重于x的变化对y的影响,以广告为例,传统的监督学习往往是给定这个特征去预测用户是否会点击,而增益模型注重的是给这个客户投放广告与否对客户是否购买广告商品所产生的影响。

因果推断

因果推断是从观察到的数据中推断出变量之间的因果关系的过程。在统计学和数据科学中,因果推断涉及到尝试理解一个事件或行为是什么导致了另一个事件或行为。这与相关性或关联不同,因果推断试图确定一个变量的变化是否直接导致另一个变量的变化。

机器学习(4)-决策树

决策树

决策树是一种用于分类和回归问题的监督学习算法。它通过树状图的结构来表示和推断决策规则。每个内部节点表示一个特征或属性,每个分支代表一个决策规则,而每个叶节点表示一个类别标签或一个数值。

决策树的学习过程形成了一个递归的分治算法,其中每个节点都对应于一个特征,并且通过节点上的决策规则将数据集分割成更纯的子集。在决策树的构建过程中,选择最佳特征和分割数据的目标是提高每个节点的纯度,使得决策树在训练数据上达到最佳的拟合效果。

机器学习(3)-支持向量机

支持向量机

支持向量机(Support Vector Machine,SVM)是一种用于分类和回归分析的监督学习算法。它的主要目标是找到一个超平面,将数据集划分成两个类别,同时使得两个类别中距离最近的数据点到该超平面的距离最大化。这两个最近的数据点被称为支持向量。

SVM 可以使用核函数来处理非线性可分的数据。核函数可以将输入特征映射到高维空间,从而在高维空间中找到一个线性超平面来解决原始空间中的非线性问题。

支持向量机公式

硬间隔支持向量机(Hard Margin SVM)

硬间隔支持向量机适用于线性可分的数据集。

\subsubsection*{优化问题}

目标是最大化间隔,同时保证所有样本都正确分类:

约束条件:

其中:

  • $ \mathbf{w} $ 是权重向量。
  • $ b $ 是偏置项。
  • $ \mathbf{x}_i $ 是第 $ i $ 个样本的特征向量。
  • $ y_i $ 是第 $ i $ 个样本的标签,取值为 $ \pm 1 $。

软间隔支持向量机(Soft Margin SVM)

软间隔支持向量机适用于线性不可分的数据集,允许一定程度的误分类。

\subsubsection*{优化问题}

目标是最大化间隔,同时引入松弛变量 $ \xi_i $ 来允许一些样本在间隔内部或错误分类:

约束条件:

其中:

  • $ C $ 是正则化参数,控制误分类和间隔大小之间的平衡。
  • $ \xi_i $ 是松弛变量,表示第 $ i $ 个样本允许的误分类程度。

对偶问题

通过拉格朗日乘子法,可以将原始问题转化为对偶问题。

\subsubsection*{拉格朗日函数}

其中:

  • $ \alpha_i $ 是拉格朗日乘子。
  • $ \mu_i $ 是非负的拉格朗日乘子。

\subsubsection*{对偶问题}

通过求解拉格朗日函数关于 $ \mathbf{w} $ 和 $ b $ 的偏导数并令其为零,可以得到对偶问题:

约束条件:

决策函数

通过求解对偶问题,可以得到支持向量机的决策函数:

其中,只有当 $ \alpha_i > 0 $ 时,对应的样本 $ \mathbf{x}_i $ 才是支持向量。

核技巧(Kernel Trick)

对于非线性可分的数据集,可以通过核函数将数据映射到高维空间,使数据在高维空间中线性可分。

常见的核函数

  • 线性核

  • 多项式核

  • 高斯核(RBF核)

  • Sigmoid核

机器学习(2)-朴素贝叶斯

朴素贝叶斯

朴素贝叶斯(Naive Bayes)是一种基于贝叶斯定理的分类算法,被广泛用于文本分类和其他分类问题。它被称为”朴素”是因为它假设每个特征与其他特征之间都是相互独立的,这是一个较为简化的假设,但在实践中,朴素贝叶斯通常表现得相当好。

在朴素贝叶斯中,我们考虑一个分类问题,其中 A 是类别,而 B 是特征。贝叶斯定理用于计算给定特征的情况下某个类别的概率。我们可以使用训练数据中的频率估计概率,并计算每个类别的概率。然后,给定一个新的特征向量,我们可以使用贝叶斯定理计算每个类别的后验概率,并选择具有最高概率的类别作为预测结果。

朴素贝叶斯公式

贝叶斯定理

贝叶斯定理描述了在已知某些证据的情况下,某个假设的概率:

其中:

  • ( P(A|B) ) 是在事件 B 发生的情况下,事件 A 发生的后验概率。
  • ( P(B|A) ) 是在事件 A 发生的情况下,事件 B 发生的条件概率。
  • ( P(A) ) 是事件 A 发生的先验概率。
  • ( P(B) ) 是事件 B 发生的边缘概率。

\subsection*{朴素贝叶斯分类器}

在朴素贝叶斯分类器中,假设特征之间相互独立。对于一个给定的样本 ( x = (x_1, x_2, \ldots, x_n) ),我们需要计算每个类别的后验概率 ( P(C_k | x) ),并选择具有最高后验概率的类别作为预测结果。

后验概率

根据贝叶斯定理,后验概率可以表示为:

其中:

  • ( P(C_k | x) ) 是在特征向量 ( x ) 给定的情况下,类别 ( C_k ) 的后验概率。
  • ( P(x | C_k) ) 是在类别 ( C_k ) 给定的情况下,特征向量 ( x ) 的条件概率。
  • ( P(C_k) ) 是类别 ( C_k ) 的先验概率。
  • ( P(x) ) 是特征向量 ( x ) 的边缘概率。

条件概率

由于假设特征之间相互独立,条件概率 ( P(x | C_k) ) 可以分解为各个特征的条件概率的乘积:

类别先验概率

类别先验概率 ( P(C_k) ) 通常通过训练数据中的类别频率来估计:

边缘概率

边缘概率 ( P(x) ) 可以通过全概率公式计算:

但在实际应用中,通常不需要显式计算 ( P(x) ),因为它是所有类别的后验概率的归一化常数,不会影响类别的选择。

最大后验概率

选择具有最大后验概率的类别作为预测结果:

机器学习(1)-逻辑回归

逻辑回归

逻辑回归(Logistic Regression)是一种用于解决二分类问题的监督学习算法,尽管名称中包含”回归”一词,但实际上它用于分类任务。
逻辑回归使用一个假设函数(sigmoid函数),将输入特征的线性组合映射到一个在0和1之间的概率值。逻辑回归将概率值转换为二分类的决策,通常使用一个阈值(例如0.5)。逻辑回归使用交叉熵损失函数来衡量预测概率与实际标签之间的差异。损失函数的目标是最小化误差。

交叉熵损失函数(Cross-Entropy Loss)

逻辑回归的梯度计算

对于单个样本 $(x^{(i)}, y^{(i)})$,其损失为:

其中 $\hat{y}^{(i)} = \sigma(z^{(i)})$ 是模型的输出,$z^{(i)} = w^T x^{(i)} + b$,而 $\sigma(z) = \frac{1}{1 + e^{-z}}$ 是 sigmoid 函数。

对于整个训练集,总损失 $J(w, b)$ 为所有样本损失的平均值:

对 $w_j$ 的梯度

对 $b$ 的梯度

参数更新

在每次迭代中,参数 $w$ 和 $b$ 会根据计算出的梯度进行更新:

其中 $\alpha$ 是学习率,控制着每次参数更新的步伐大小。