Gaussian Process
Gaussian distribution
Univariate Gaussian Distribution is simple, here we focus on multivariate graussian distribution, where each random variable is distributed normally and their joint distribution is also Gaussian. The multivariate Gaussian distribution is defined by a mean vector $\mu$ and a covariance matrix $\Sigma$. The covariance matrix is always symmetric and positive semi-definite. why is Gaussian distribution so important? because under the assumptions of the central limit theorem, we can use it to model many events in the real world. Moreover, Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. Conditioning is the cornerstone of Gaussian processes since it allows Bayesian inference.
Grassian process
what is GP
A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.
A Gaussian process is characterized by a mean function $\mu$ and a covariance function (or kernel function) k. Intuitively, a Gaussian process can be interpreted as a normal distribution over functions and is therefore often called an infinite-dimensional Gaussian.
Here’s an analogy: Consider a multivariate normal distribution over a set of points in 2D space. Each draw from this distribution corresponds to a vector of values, one for each point. Now, extend this idea to an infinite number of points, and you get a function. The Gaussian process is like having a normal distribution over all possible functions that could describe your data.
Mean and covariance functions
The prior mean function $m(⋅)$ describes the average function under the GP distribution before seeing any data. Therefore, it offers a straightforward way to incorporate prior knowledge about the function we wish to model. In the absence of this type of prior knowledge, a common choice is to set the prior mean function to zero, i.e., $m(⋅)≡0$.
The covariance function $k(x,x’)$ computes the covariance $cov[f(x),f(x′)]$ between the corresponding function values by evaluating the covariance function
k at the corresponding inputs x,x′(kernel trick ).Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability. Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability.
How GP works
For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. The goal of Gaussian processes is to learn this underlying distribution from training data. Respective to the test data X, we will denote the training data as Y. As we have mentioned before, the key idea of Gaussian processes is to model the underlying distribution of
X together with Y as a multivariate normal distribution. The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available. In the case of Gaussian processes, this information is the training data. Thus, we are interested in the conditional probability
$P(X∣Y)$.
In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at $∣X∣=N$ test points, the corresponding multivariate Gaussian distribution is also
N -dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the i-th component of the resulting vector as the function value corresponding to the i-th test point.
Marginal likelihood and GP training
A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.
The likelihood function represents the probability of observing the given data X given a specific set of parameter values $\theta$ in a statistical model. It expresses how well the parameters explain the observed data. The likelihood function is a key component in frequentist statistics. It is used to estimate the maximum likelihood estimates (MLE) of the parameters.
The marginal likelihood represents the probability of observing the data X without specifying a particular set of parameters. It is obtained by integrating (or summing) the likelihood function over all possible values of the parameters, weighted by the prior distribution of the parameters. The marginal likelihood is a key concept in Bayesian statistics. It serves as a normalizing constant, ensuring that the posterior distribution integrates (or sums) to 1. It is also used in Bayesian model comparison, where different models are compared based on their marginal likelihoods.
To train the GP, we maximize the marginal likelihood with respect to the GP hyperparameters,i.e., the parameters of the mean and covariance functions, which we summarize by $\theta$.
Maximizing the marginal likelihood behaves much better than finding maximum likelihood $\operatorname*{argmax}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})$ or maximum a-posteriori point estimates $\mathop{\mathrm{argmax}}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})p(f(\mathbf{X})|\theta)$.
These two approaches would lead to overfitting, since it is possible to get arbitrarily high likelihoods by placing the function values $f(X)$ on top of the observations y and letting the the noise $\sigma_n$ tend to zero. In contrast, the marginal likelihood does not fit function values directly, but integrates them out.By averaging (integrating out) the direct model parameters, i.e., the function values, the marginal likelihood automatically trades off data fit and model complexity.Choose a model that is too inflexible, and the marginal likelihood
$p(y∣X,θ)$ will be low because few functions in the prior fit the data. A model that is too flexible spreads its density over too many datasets, and so $p(y∣X,θ)$ will also be low.
what is kernel
If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by k(x, x′); this is sometimes called the kernel trick. This technique is kernel trick particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves.
The kernel k, which is often also called covariance function, pairwise on all the points. The kernel receives two points
$t,t’ \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar.
We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix.
Kernels can be separated into stationary and non-stationary kernels. Stationary kernels, such as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only dependent on their relative position. Non-stationary kernels, such as the linear kernel, do not have this constraint and depend on an absolute location.
The kernel is used to define the entries of the covariance matrix. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable.
A kernel function coulde be stationary or isotropic. A kernel function is stationary if $k(x,x’)=k(x-x’)$. A kernel function is isotropic is $k(x,x’)=k(||x-x’||_2)$. Stationarity implies that the covariance function only depends on distances
$∥x−x’∥$ of the corresponding inputs, and not on the location of the individual data points. This means that if the inputs are close to each other, the corresponding function values are strongly correlated.
Interpretation of the hyperparameters
Stationary covariance functions typically contain the term $\frac\tau l=\frac{|\mathbf{x}-\mathbf{x}^{\prime}|}l$. where
$l$ is a lengthscale parameter. Longer lengthscales cause long-range correlations, whereas for short lengthscales, function values are strongly correlated only if their respective inputs are very close to each other. This allows functions to vary strongly and display more flexibility in the range of the data.
The signal variance parameter $\sigma_f^2$ allows us to say something about the amplitude of the function we model.
training tips
The marginal likelihood is non-convex with potentially multiple local optima. Therefore, we may end up in (bad) local optima when we choose a gradient-based optimization method. In order to initialize these parameters to reasonable values when we optimize the marginal likelihood, we need to align them with what we know about the data, either empirically or using prior knowledge. Assume, we have training inputs
X and training targets y. We will see that the signal and noise variances can be initialized using statistics of the training targets, whereas the lengthscale parameters can be initialized using statistics of the training inputs. A reasonable intialization that works well in practice is to set the signal variance to the empirical variance of the observed function values, and the noise variance to a smaller value.
Local optima are the largest problem that prevent good lengthscales from being selected through gradient-based optimisation. Generally, we can observe two different types of local optima:
Long lengthscale, large noise. Often the lengthscale is so long that the prior only allows nearly linear functions in the posterior. As a consequence, a large amount of noise is required to account for the residuals, leading to a small signal-to-noise ratio. This looks like underfitting, as non-linearities in the data are modelled as noise instead of being learned as part of the function.
Short lengthscale, low noise. Short lengthscales allow the posterior mean to fit to small variations in the data. Often such solutions are accompanied by small noise, and therefore a high signal-to-noise ratio. Such solutions look like they overfit, since the means fit the data by making drastic and fast changes, while generalizing poorly. However, the short lengthscale also prevents the predictive error bars from being small, so all predictions will be made with high uncertainty. In the probabilistic sense, this also looks like underfitting.
Which optimum we end up in, depends on the initialization of our lengthscale as we are likely to end up in a local optimum nearest to our initial choice. In both cases, the optimizer is more likely to get stuck in a local optimum if the situations are a somewhat plausible explanations of the data. In practice, it is harder to get out of a long lengthscale situation since the optimizer often struggles to get beyond the (typically) huge plateau that is typical for very long lengthscales.
How to choose a kernel
The choice of kernel (a.k.a. covariance function) determines almost all the generalization properties of a GP model.
In fact, you might decide that choosing the kernel is one of the main difficulties in doing inference - and just as you don’t know what the true parameters are, you also don’t know what the true kernel is. Probably, you should try out a few different kernels at least, and compare their marginal likelihood on your training data.
others
The GP does not require any input normalization, but it can make sense to do so for numerical reasons.
reference
https://distill.pub/2019/visual-exploration-gaussian-processes/
https://www.cs.toronto.edu/~duvenaud/cookbook/ \
https://infallible-thompson-49de36.netlify.app/ \
A. Krause, “Probabilistic Artificial Intelligence”.