pai - Markov Decision Processes

Markov Decision Processes

Planning deals with the problem of deciding which action an agent should play in a (stochastic) environment(An environment is stochastic as opposed to deterministic, when the outcome of actions is random.). A key formalism for probabilistic plan ning in known environments are so-called Markov decision processes.
Alt text

Our fundamental objective is to learn how the agent should behave to optimize its reward. In other words, given its current state, the agent should decide (optimally) on the action to play. Such a decision map — whether optimal or not — is called a policy.

Alt text

Alt text

For the purpose of our discussion of Markov decision processes and reinforcement learning, we will focus on a very common reward called discounted payoff.

Alt text

Alt text

Because we assumed stationary dynamics, rewards, and policies, the discounted payoff starting from a given state x will be independent of the start time t.

Bellman Expectation Equation

Let us now see how we can compute the value function:

Alt text

This equation is known as the Bellman expectation equation, and it shows a recursive dependence of the value function on itself. The intuition is clear: the value of the current state corresponds to the reward from the next action plus the discounted sum of all future rewards obtained from the subsequent states.

Policy Evaluation

Bellman’s expectation equation tells us how we can find the value function vπ of a fixed policy π using a system of linear equations.

Alt text

Fixed-point Iteration

Alt text

Policy Optimization

Greedy Policies

Alt text

Bellman Optimality Equation

Alt text

Alt text
These equations are also called the Bellman optimality equations. Intuitively, the Bellman optimality equations express that the value of a state under an optimal policy must equal the expected return for the best action from that state. Bellman’s theorem is also known as Bellman’s optimality principle, which is a more general concept.

Policy Iteration

Alt text

It can be shown that policy iteration converges to an exact solution in a polynomial number of iterations.Each iteration of policy iteration requires computing the value function, which we have seen to be of cubic complexity in the number of states.

Value Iteration

Alt text
Value iteration converges to an ε-optimal solution in a polynomial number of iterations. Unlike policy iteration, value iteration does not converge to an exact solution in general.an iteration of 7 Sparsity refers to the interconnectivity of the state space. When only few states are reachable from any state, we call an MDP sparse. value iteration can be performed in (virtually) constant time in sparse Markov decision processes.

Partial Observability

In this section, we consider how Markov decision processes can be extended to a partially observable setting where the agent can only access noisy observations Yt of its state Xt.

Alt text

Observe that a Kalman filter can be viewed as a hidden Markov model with conditional linear Gaussian motion and sensor models and a Gaussian prior on the initial state.

POMDPs can be reduced to a Markov decision process with an enlarged state space.

pai - Bayesian Optimization

Bayesian Optimization

Alt text

Exploration-Exploitation Dilemma

In Bayesian optimization, we want to learn a model of f ⋆ and use this model to optimize f ⋆ simultaneously. These goals are somewhat contrary. Learning a model of f ⋆ requires us to explore the input space while using the model to optimize f ⋆ requires us to focus on the most promising well-explored areas. This trade-off is commonly known as the exploration-exploitation dilemma.

It is common to use a so-called acquisition function to greedily pick the next point to sample based on the current model.

Online Learning and Bandits

Alt text

Multi-Armed Bandits(MAB)

The “multi-armed bandits” (MAB) problem is a classical, canonical formalization of the exploration-exploitation dilemma. In the MAB problem, we are provided with k possible actions (arms) and want to maximize our reward online within the time horizon T. We do not know the reward distributions of the actions in advance, however, so we need to trade learning the reward distribution with following the most promising action.

Bayesian optimization can be interpreted as a variant of the MAB problem where there can be a potentially infinite number of actions (arms), but their rewards are correlated (because of the smoothness of the Gaussian process prior). Smoothness means that nearby points in the input space are likely to have similar function values. Because of the smoothness property inherent in the Gaussian process prior, Bayesian optimization can make informed decisions about where to explore next in the input space. The model can leverage information from previously evaluated points to predict the behavior of the objective function at unexplored points more effectively.

Regret

The key performance metric in online learning is the regret.

Alt text

Achieving sublinear regret requires balancing exploration with exploitation. Typically, online learning (and Bayesian optimization) consider stationary environments, hence the comparison to the static optimum.

Acquisition Functions

Throughout our description of acquisition functions, we will focus on a setting where we model $f^⋆$ using a Gaussian process which we denote by f. The methods generalize to other means of learning $f^⋆$ such as Bayesian neural networks.

Alt text

One possible acquisition function is uncertainty sampling. However, this acquisition function does not at all take into account the objective of maximizing $f^⋆$ and focuses solely on exploration.

Suppose that our model f of $f^⋆$ is well-calibrated, in the sense that the true function lies within its confidence bounds. Consider the best lower bound, that is, the maximum of the lower confidence bound. Now, if the true function is really contained in the confidence bounds, it must hold that the optimum is somewhere above this best lower bound.

Therefore, we only really care how the function looks like in the regions where the upper confidence bound is larger than the best lower bound. The key idea behind the methods that we will explore is to focus exploration on these plausible maximizers.

Upper Confidence Bound

Alt text
This acquisition function naturally trades exploitation by preferring a large posterior mean with exploration by preferring a large posterior variance.
This optimization problem is non-convex in general. However, we can use approximate global optimization techniques like Lipschitz optimization (in low dimensions) and gradient ascent with random initialization (in high dimensions). Another widely used option is to sample some random points from the domain, score them according to this criterion, and simply take the best one.

Alt text
Observe that if the information gain is sublinear in T then we achieve sublinear regret and, in particular, converge to the true optimum. The term “sublinear” refers to a growth rate that is slower than linear.

Alt text
Intuitively, to work even if the unknown function $f^⋆$ is not contained in the confidence bounds, we use βt to re-scale the confidence bounds to enclose $f^⋆$.

Probability of Improvement

Alt text

Probability of improvement tends to be biased in favor of exploitation, as it prefers points with large posterior mean and small posterior variance.

Expected Improvement

Alt text

Intuitively, expected improvement seeks a large expected improvement (exploitation) while also preferring states with a large variance (exploration).

Thompson Sampling

Alt text

Probability matching is exploratory as it prefers points with larger variance (as they automatically have a larger chance of being optimal), but at the same time exploitative as it effectively discards points with low posterior mean and low posterior variance. Unfortunately, it is generally difficult to compute π analytically given a posterior. Instead, it is common to use a sampling-based approximation of π.

Alt text

In many cases, the randomness in the realizations of ̃ ft+1 is already sufficient to effectively trade exploration and exploitation.

Model Selection

Selecting a model of f ⋆ is much harder than in the i.i.d. data setting of supervised learning. There are mainly the two following dangers, • the data sets collected in active learning and Bayesian optimization are small; and • the data points are selected dependently on prior observations. This leads to a specific danger of overfitting. In particular, due to feedback loops between the model and the acquisition function, one may end up sampling the same point repeatedly.

Another approach that often works fairly well is to occasionally (according to some schedule) select points uniformly at random instead of using the acquisition function. This tends to prevent getting stuck in suboptimal parts of the state space.

difference betwwen active learning and Bayesian Optimization

Problem Setting:
Active Learning: Active learning typically deals with supervised learning tasks where there is a large pool of unlabeled instances, and the algorithm decides which instances to query for labels to improve the model.
Bayesian Optimization: Bayesian optimization deals with optimization problems where the objective function is unknown, expensive to evaluate, and possibly noisy. It aims to find the global optimum with as few evaluations as possible.

Nature of Queries:
Active Learning: In active learning, the queries are often in the form of “Which instance should be labeled next?” The goal is to select instances that will most benefit the model’s learning process.
Bayesian Optimization: In Bayesian optimization, the queries are in the form of “What point should be evaluated next in the input space to maximize/minimize the objective function?” The goal is to efficiently explore the input space and find the optimal configuration.

Algorithmic Approaches:
Active Learning: Active learning involves various strategies such as uncertainty sampling, query-by-committee, and diversity sampling to select informative instances for labeling.
Bayesian Optimization: Bayesian optimization employs probabilistic surrogate models (often Gaussian processes) to model the unknown objective function. Acquisition functions guide the search to balance exploration and exploitation efficiently.

pai - active learning

Active learning

Active learning is a machine learning paradigm in which a model is trained on a dataset that is not entirely labeled. Instead of relying solely on a fully labeled dataset for training, active learning involves an iterative process where the model actively selects the most informative instances from an unlabeled pool, queries an oracle (typically a human annotator), and adds the newly labeled instances to the training set. The model is then retrained on the expanded labeled dataset.

The key idea behind active learning is to strategically choose the most valuable instances for labeling, with the goal of improving the model’s performance while minimizing the number of labeled instances needed. This process is especially beneficial when obtaining labeled data is expensive or time-consuming.

Active learning strategies vary in how they select instances for labeling. Common strategies include uncertainty sampling (select instances where the model is uncertain), query-by-committee (select instances where different model hypotheses disagree), and diversity sampling (select instances to ensure diverse coverage of the input space).

Conditional Entropy

Alt text

Intuitively, the conditional entropy of X given Y describes our average surprise about realizations of X given a particular realization of Y, averaged over all such possible realizations of Y. In other words, conditional entropy corresponds to the expected remaining uncertainty in X after we observe Y.

Alt text
That is, the joint entropy of X and Y is given by the uncertainty about X and the additional uncertainty about Y given X.

Mutual Information

Alt text

In words, we subtract the uncertainty left about X after observing Y from our initial uncertainty about X. This measures the reduction in our uncertainty in X (as measured by entropy) upon observing Y.

Alt text

Thus, the mutual information between X and Y can be understood as the approximation error (or information loss) when assuming that X and Y are independent.

Alt text

Thus, the conditional mutual information corresponds to the reduction of uncertainty in X when observing Y, given we already observed Z.

Following our introduction of mutual information, it is natural to answer the question “where should I collect data?” by saying “wherever mutual information is maximized”.

Submodularity of Mutual Information

Alt text

Alt text

That is, “adding” x to the smaller set A yields more marginal gain than adding x to the larger set B. In other words, the function F has “diminishing returns”. In this way, submodularity can be interpreted as a notion of “concavity” for discrete functions.

Alt text

Maximizing Mutual Information

Uncertainty Sampling

Alt text

Therefore, if f is modeled by a Gaussian and we assume homoscedastic noise, greedily maximizing mutual information corresponds to simply picking the point x with the largest variance. This strategy is also called uncertainty sampling.

Heteroscedastic Noise

Uncertainty sampling is clearly problematic if the noise is heteroscedastic. If there are a particular set of inputs with a large aleatoric uncertainty dominating the epistemic uncertainty, uncertainty sampling will continuously choose those points even though the epistemic uncertainty will not be reduced substantially.
Alt text

Thus, we choose locations that trade large epistemic uncertainty with large aleatoric uncertainty. Ideally, we find a location where the epistemic uncertainty is large, and the aleatoric uncertainty is low, which promises a significant reduction of uncertainty around this location.

Classification

Here, uncertainty sampling corresponds to selecting samples that maximize the entropy of the predicted label yx.

Probabilistic Artificial Intelligence - Bayesian Deep Learning

SWAG(Stochastic Weight Averaging Gaussian)

This paper proposes a different approach to Bayesian deep learning: they use the information contained in the SGD trajectory to efficiently approximate the posterior distribution over the weights of the neural network [1].

SWA

This paper shows that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training [2].

cyclical learning rate schedule

Calibration of Modern Neural Networks

Confidence calibration – the problem of predicting probability estimates representative of the true correctness likelihood – is important for classification models in many applications. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration.

references

[1] Maddox W J, Izmailov P, Garipov T, et al. A simple baseline for bayesian uncertainty in deep learning[J]. Advances in neural information processing systems, 2019, 32.
[2] Izmailov P, Podoprikhin D, Garipov T, et al. Averaging weights leads to wider optima and better generalization[J]. arXiv preprint arXiv:1803.05407, 2018.
[3] Guo C, Pleiss G, Sun Y, et al. On calibration of modern neural networks[C]//International conference on machine learning. PMLR, 2017: 1321-1330.

Probabilistic Artificial Intelligence - Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods

The key idea of Markov chain Monte Carlo methods is to construct a Markov chain, which is efficient to simulate and has the stationary distribution p.

Markov Chains

Alt text

Intuitively, the Markov property states that future behavior is independent of past states given the present state.

Markov chains can be classified into different types based on their behavior. These classifications include time-homogeneous and time-inhomogeneous Markov chains, irreducible Markov chains, and periodic and aperiodic Markov chains.

We restrict our attention to time-homogeneous Markov chains. That is, the transition probabilities do not change over time, which can be characterized by a transition function.

Irreducible Markov chains are those in which every state can be reached from any other state, while periodic Markov chains exhibit a repeating pattern in their states. On the other hand, aperiodic Markov chains do not exhibit any regular pattern in their states. If there is no fixed period at which the probabilities return to the starting values, the chain is classified as aperiodic. Aperiodic Markov chains often display a more complex and unpredictable behavior compared to periodic ones.

Stationarity

Alt text

Alt text

Convergence

Alt text

Alt text

Alt text

A Markov Chain is called ergodic, if there exists a finite t such that every state can be reached from every state in exactly t steps.

Fundamental theorem of ergodic Markov chains

Alt text

Detailed Balance Equation

Alt text

Ergodic Theorem

Alt text

This result is the fundamental reason for why Markov chain Monte Carlo methods are possible. In practice, one observes that Markov chain Monte Carlo methods have a so-called “burn-in” time during which the distribution of the Markov chain does not yet approximate the posterior distribution well. Typically, the first t0 samples are therefore discarded. It is not clear in general how T and t0 should be chosen such that the estimator is unbiased, rather they have to be tuned.

Elementary Sampling Methods

Metropolis-Hastings Algorithm

Alt text

Gibbs Sampling

A popular example of a Metropolis-Hastings algorithm is Gibbs sampling.

Alt text

Intuitively, by re-sampling single coordinates according to the posterior distribution given the other coordinates, Gibbs sampling finds states that are successively “more” likely.

Langevin Dynamics

Gibbs Distributions

references

https://www.collimator.ai/reference-guides/what-is-a-aperiodic-markov-chain

Probabilistic Artificial Intelligence - Variational Inference

Variational Inference

The fundamental idea behind variational inference is to approximate the true posterior distribution using a “simpler” posterior that is as close as possible to the true posterior.

where λ represents the parameters of the variational posterior q, also called variational parameters.

Laplace Approximation

A natural idea for approximating the posterior distribution is to use a Gaussian approximation (that is, a second-order Taylor approximation) around the mode of the posterior. Such an approximation is known as a Laplace approximation.

As an example, we will look at Laplace approximation in the context of Bayesian logistic regression. Bayesian logistic regression corresponds to Bayesian linear regression with a Bernoulli likelihood.

Intuitively, the Laplace approximation matches the shape of the true posterior around its mode but may not represent it accurately elsewhere — often leading to extremely overconfident predictions.

Inference with a Variational

Information Theory

Surprise

The surprise about an event with probability u is defined as $S[u] = -log u$.
Alt text

Entropy

The entropy of a distribution p is the average surprise about samples from p.if the entropy of p is large, we are more uncertain about x ∼ p than if the entropy of p were low.

Cross-Entropy

Alt text
Cross-entropy can also be expressed in terms of the KL-divergence. Quite intuitively, the average surprise in samples from p with respect to the distribution q is given by the inherent uncertainty in p and the additional surprise that is due to us assuming the wrong data distribution q. The “closer” q is to the true data distribution p, the smaller is the additional average surprise.

Variational Families

We can view variational inference more generally as a family of approaches aiming to approximate the true posterior distribution by one that is closest (according to some criterion) among a “simpler” class of distributions.
Alt text

Kullback-Leibler Divergence

Alt text

In words, KL(p∥q) measures the additional expected surprise when observing samples from p that is due to assuming the (wrong) distribution q and which not inherent in the distribution p already. Intuitively, the KL-divergence measures the information loss when approximating p with q.
Alt text

Forward and reverse KL-divergence

KL(p∥q) is also called the forward (or inclusive) KL-divergence. In contrast, KL(q∥p) is called the reverse (or exclusive) KL-divergence.

It can be seen that the reverse KL-divergence tends to greedily select the mode and underestimating the variance which, in this case, leads to an overconfident prediction. The forward KL-divergence, in contrast, is more conservative and yields what one could consider the “desired” approximation.

The reverse KL-divergence is typically used in practice. This is for computational reasons. In principle, if the variational family Q is “rich enough”, minimizing the forward and reverse KL-divergences will yield the same result.

There is a nice interpretation of minimizing the forward KullbackLeibler divergence of the approximation $q_\lambda$ with respect to the true distribution p as being equivalent to maximizing the marginal model likelihood on a sample of infinite size. This interpretation is not useful in the setting where p is a posterior distribution over model parameters θ as a maximum likelihood estimate requires samples from p which we cannot obtain in this case.

KL-divergence of Gaussians

Alt text

Alt text

Alt text

Evidence Lower Bound

The Evidence Lower Bound is a quantity maximization of which is equivalent to minimizing the KL-divergence. As its name suggests, the evidence lower bound is a (uniform) lower bound to the log-evidence $log p(y{1:n}|x{1:n})$.
Alt text

Alt text

Note that this inequality lower bounds the logarithm of an integral by an expectation of a logarithm over some variational distribution q. Hence, the ELBO is a family of lower bounds — one for each variational distribution. Such inequalities are called variational inequalities.

Alt text

Gradient of Evidence Lower Bound

Alt text
Obtaining the gradient of the evidence lower bound is more difficult. This is because the expectation integrates over the measure $q_\lambda$, which depends on the variational parameters $\lambda$. Thus, we cannot move the gradient operator inside the expectation.

There are two main techniques which are used to rewrite the gradient in such a way that Monte Carlo sampling becomes possible. The first is to use score gradients. The second is the so-called reparameterization trick.

reparameterization trick

Alt text
Alt text

Alt text

The procedure of approximating the true posterior using a variational posterior by maximizing the evidence lower bound using stochastic optimization is also called black box stochastic variational inference. The only requirement is that we can obtain unbiased gradient estimates from the evidence lower bound (and the likelihood).

Probabilistic Artificial Intelligence - Bayesian Linear Regression

Bayesian Linear Regression

Linear Regression

Given a set of (x, y) pairs, linear regression aims to find a linear model that fits the data optimally.
Given the linear model $y=w^Tx$, we want to find optimal weights $w$.
There are many ways of estimating w from data, the most common being the least squares estimator:

A slightly different estimator is used for ridge regression:

As the formula shows, the squared $l_2$ regularization term $\lambda|w|_2^2$ penalizes large $w$ and thus reduces the complexity of the resulting model,so Ridge regression is more robust than standard linear regression in the presence of multicollinearity. Multicollinearity occurs when multiple independent inputs are highly correlated. In this case, their individual effects on the predicted variable cannot be estimated well. Classical linear regression is highly volatile to small input changes. The regularization of ridge regression reduces this volatility by introducing a bias on the weights towards 0.

uncertainty

In practice, our data D is merely a sample of the process we are modeling. In these cases, we are looking for models that generalize to unseen data.
Therefore, it is useful to express the uncertainty about our model due to the lack of data. This uncertainty is commonly referred to as the epistemic uncertainty.
Usually, there is another source of uncertainty called aleatoric uncertainty, which originates directly from the process that we are modeling. This uncertainty is the noise in the labels that cannot be explained by the inputs.

Weight-space View

The most immediate and natural Bayesian interpretation of linear regression is to simply impose a prior on the weights $w$.

Assume that prior $w\sim\mathcal{N}(0,\sigma_p^2I)$ and likelihood $y_i\mid x_i,w\sim\mathcal{N}(w^\top x_i,\sigma_n^2).$ are both Gaussian, we will get the posterior distribution over the weights as:

As the above is a quadratic form in $w$, it follows that the posterior distribution is a Gaussian.
This also shows that Gaussians with known variance and linear likelihood are self-conjugate. A distribution is said to be self-conjugate (or a conjugate prior is self-conjugate) if, when used as a prior distribution, it results in a posterior distribution that belongs to the same family of distributions. It can be shown more generally that Gaussians with known variance are self-conjugate to any Gaussian likelihood. For general distributions the posterior will not be closed-form. This is a very special property of Gaussians.
We can compute the MAP estimate for the weights,

we can find that this is simply the MLE loss with an additional $l_2$ regularization term and this coincides with the optimization objective of ridge regression with weight decay $\lambda=\frac{\sigma_n^2}{\sigma_p^2}$. Also, recall that the MAP estimate corresponds to the mode of the posterior distribution, which in the case of a Gaussian is simply its mean.(The mode of a probability distribution is the value where the distribution reaches its maximum point. In the context of the posterior distribution, the mode corresponds to the most probable value of the parameter given the observed data.).

A commonly used alternative to ridge regression is the least absolute shrinkage and selection operator (or lasso), which uses $l_1$ regularization.It turns out that lasso can also be viewed as Bayesian learning, using a Laplace prior.

Note that using point estimates like the MAP estimate does not quantify uncertainty in the weights. The MAP estimate simply collapses all mass of the posterior around its mode.This is especially harmful when we are unsure about the best model.

Recursive Bayesian Updates

As data arrives online (i.e., in “real-time”), we can obtain the new posterior and use it to replace our prior.

Non-linear Regression

We can use linear regression not only to learn linear functions. The trick is to apply a non-linear transformation $\phi:\mathbb{R}^d\to\mathbb{R}^e$.
In Polynomial regression, to learn polynomials of degree m in d input dimensions, we need to apply the nonlinear transformation

the dimension of the feature space grows exponentially in the degree of polynomials and input dimensions. Even for relatively small m and d, this becomes completely unmanageable.

Function-space View

Previously, we have been interpreting it as a distribution over the weights $w$ of a linear function $\hat{f}=\boldsymbol{\Phi}w.$, we can equivalently consider a distribution directly over the estimated function values. Instead of considering a prior over the weights${w\sim\mathcal{N}}(0,\sigma_p^2I)$, we now impose a prior directly on the values of our model at the observations.Using that Gaussians are closed under linear maps, we obtain the equivalent prior:

K is the kernel matrix.The kernel function is:
$k(x,x^{\prime})\doteq\sigma_p^2\cdot\phi(x)^\top\phi(x^{\prime})$
The kernel matrix is a covariance matrix and the kernel function measures the covariance of the function values $k(x,x^{\prime})=\mathrm{Cov}\big[f(x),f(x^{\prime})\big].$
Moreover, note that we have reformulated the learning algorithm such that the feature space is now implicit in the choice of kernel, and the kernel is defined by inner products of (nonlinearly transformed) inputs. In other words, the choice of kernel implicitly determines the class of functions that f is sampled from, and encodes our prior beliefs. This is known as the kernel trick.

reference

[1] A. Krause, “Probabilistic Artificial Intelligence”.

Probabilistic Artificial Intelligence - Gaussian Process

Gaussian Process

Gaussian distribution

Univariate Gaussian Distribution is simple, here we focus on multivariate graussian distribution, where each random variable is distributed normally and their joint distribution is also Gaussian. The multivariate Gaussian distribution is defined by a mean vector $\mu$ and a covariance matrix $\Sigma$. The covariance matrix is always symmetric and positive semi-definite. why is Gaussian distribution so important? because under the assumptions of the central limit theorem, we can use it to model many events in the real world. Moreover, Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization. Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable. Conditioning is the cornerstone of Gaussian processes since it allows Bayesian inference.

Grassian process

what is GP

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.
A Gaussian process is characterized by a mean function $\mu$ and a covariance function (or kernel function) k. Intuitively, a Gaussian process can be interpreted as a normal distribution over functions and is therefore often called an infinite-dimensional Gaussian.

Here’s an analogy: Consider a multivariate normal distribution over a set of points in 2D space. Each draw from this distribution corresponds to a vector of values, one for each point. Now, extend this idea to an infinite number of points, and you get a function. The Gaussian process is like having a normal distribution over all possible functions that could describe your data.

Mean and covariance functions

The prior mean function $m(⋅)$ describes the average function under the GP distribution before seeing any data. Therefore, it offers a straightforward way to incorporate prior knowledge about the function we wish to model. In the absence of this type of prior knowledge, a common choice is to set the prior mean function to zero, i.e., $m(⋅)≡0$.

The covariance function $k(x,x’)$ computes the covariance $cov[f(x),f(x′)]$ between the corresponding function values by evaluating the covariance function
k at the corresponding inputs x,x′(kernel trick ).Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability. Practically, the covariance function encodes structural assumptions about the class of functions we wish to model. These assumptions are generally at a high level and may include periodicity or differentiability.

How GP works

For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. The goal of Gaussian processes is to learn this underlying distribution from training data. Respective to the test data X, we will denote the training data as Y. As we have mentioned before, the key idea of Gaussian processes is to model the underlying distribution of
X together with Y as a multivariate normal distribution. The essential idea of Bayesian inference is to update the current hypothesis as new information becomes available. In the case of Gaussian processes, this information is the training data. Thus, we are interested in the conditional probability
$P(X∣Y)$.

In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at $∣X∣=N$ test points, the corresponding multivariate Gaussian distribution is also
N -dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the i-th component of the resulting vector as the function value corresponding to the i-th test point.

Marginal likelihood and GP training

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

The likelihood function represents the probability of observing the given data X given a specific set of parameter values $\theta$ in a statistical model. It expresses how well the parameters explain the observed data. The likelihood function is a key component in frequentist statistics. It is used to estimate the maximum likelihood estimates (MLE) of the parameters.

The marginal likelihood represents the probability of observing the data X without specifying a particular set of parameters. It is obtained by integrating (or summing) the likelihood function over all possible values of the parameters, weighted by the prior distribution of the parameters. The marginal likelihood is a key concept in Bayesian statistics. It serves as a normalizing constant, ensuring that the posterior distribution integrates (or sums) to 1. It is also used in Bayesian model comparison, where different models are compared based on their marginal likelihoods.

To train the GP, we maximize the marginal likelihood with respect to the GP hyperparameters,i.e., the parameters of the mean and covariance functions, which we summarize by $\theta$.

Maximizing the marginal likelihood behaves much better than finding maximum likelihood $\operatorname*{argmax}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})$ or maximum a-posteriori point estimates $\mathop{\mathrm{argmax}}{f(\mathbf{X}),\sigma_n}p(\mathbf{y}|f(\mathbf{X}),\sigma_n^2\mathbf{I})p(f(\mathbf{X})|\theta)$.
These two approaches would lead to overfitting, since it is possible to get arbitrarily high likelihoods by placing the function values $f(X)$ on top of the observations y and letting the the noise $\sigma_n$ tend to zero. In contrast, the marginal likelihood does not fit function values directly, but integrates them out.By averaging (integrating out) the direct model parameters, i.e., the function values, the marginal likelihood automatically trades off data fit and model complexity.Choose a model that is too inflexible, and the marginal likelihood
$p(y∣X,θ)$ will be low because few functions in the prior fit the data. A model that is too flexible spreads its density over too many datasets, and so $p(y∣X,θ)$ will also be low.

what is kernel

If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by k(x, x′); this is sometimes called the kernel trick. This technique is kernel trick particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves.

The kernel k, which is often also called covariance function, pairwise on all the points. The kernel receives two points
$t,t’ \in \mathbb{R}^n$ as an input and returns a similarity measure between those points in the form of a scalar.

We evaluate this function for each pairwise combination of the test points to retrieve the covariance matrix.

Kernels can be separated into stationary and non-stationary kernels. Stationary kernels, such as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only dependent on their relative position. Non-stationary kernels, such as the linear kernel, do not have this constraint and depend on an absolute location.

The kernel is used to define the entries of the covariance matrix. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable.

A kernel function coulde be stationary or isotropic. A kernel function is stationary if $k(x,x’)=k(x-x’)$. A kernel function is isotropic is $k(x,x’)=k(||x-x’||_2)$. Stationarity implies that the covariance function only depends on distances
$∥x−x’∥$ of the corresponding inputs, and not on the location of the individual data points. This means that if the inputs are close to each other, the corresponding function values are strongly correlated.

Interpretation of the hyperparameters

Stationary covariance functions typically contain the term $\frac\tau l=\frac{|\mathbf{x}-\mathbf{x}^{\prime}|}l$. where
$l$ is a lengthscale parameter. Longer lengthscales cause long-range correlations, whereas for short lengthscales, function values are strongly correlated only if their respective inputs are very close to each other. This allows functions to vary strongly and display more flexibility in the range of the data.

The signal variance parameter $\sigma_f^2$ allows us to say something about the amplitude of the function we model.

training tips

The marginal likelihood is non-convex with potentially multiple local optima. Therefore, we may end up in (bad) local optima when we choose a gradient-based optimization method. In order to initialize these parameters to reasonable values when we optimize the marginal likelihood, we need to align them with what we know about the data, either empirically or using prior knowledge. Assume, we have training inputs
X and training targets y. We will see that the signal and noise variances can be initialized using statistics of the training targets, whereas the lengthscale parameters can be initialized using statistics of the training inputs. A reasonable intialization that works well in practice is to set the signal variance to the empirical variance of the observed function values, and the noise variance to a smaller value.

Local optima are the largest problem that prevent good lengthscales from being selected through gradient-based optimisation. Generally, we can observe two different types of local optima:

Long lengthscale, large noise. Often the lengthscale is so long that the prior only allows nearly linear functions in the posterior. As a consequence, a large amount of noise is required to account for the residuals, leading to a small signal-to-noise ratio. This looks like underfitting, as non-linearities in the data are modelled as noise instead of being learned as part of the function.
Short lengthscale, low noise. Short lengthscales allow the posterior mean to fit to small variations in the data. Often such solutions are accompanied by small noise, and therefore a high signal-to-noise ratio. Such solutions look like they overfit, since the means fit the data by making drastic and fast changes, while generalizing poorly. However, the short lengthscale also prevents the predictive error bars from being small, so all predictions will be made with high uncertainty. In the probabilistic sense, this also looks like underfitting.

Which optimum we end up in, depends on the initialization of our lengthscale as we are likely to end up in a local optimum nearest to our initial choice. In both cases, the optimizer is more likely to get stuck in a local optimum if the situations are a somewhat plausible explanations of the data. In practice, it is harder to get out of a long lengthscale situation since the optimizer often struggles to get beyond the (typically) huge plateau that is typical for very long lengthscales.

How to choose a kernel

The choice of kernel (a.k.a. covariance function) determines almost all the generalization properties of a GP model.
​In fact, you might decide that choosing the kernel is one of the main difficulties in doing inference - and just as you don’t know what the true parameters are, you also don’t know what the true kernel is. Probably, you should try out a few different kernels at least, and compare their marginal likelihood on your training data.

others

The GP does not require any input normalization, but it can make sense to do so for numerical reasons.

reference

https://distill.pub/2019/visual-exploration-gaussian-processes/
https://www.cs.toronto.edu/~duvenaud/cookbook/ \
https://infallible-thompson-49de36.netlify.app/ \
A. Krause, “Probabilistic Artificial Intelligence”.

speech

signal

spectrogram: A spectrogram of a time signal is a special two-dimensional representation that displays time in its horizontal axis and frequency in its vertical axis.

short-time Fourier analysis

Why use it?
Some regions of speech signals shorter than 100 milliseconds often appear to be periodic, so that we can use the exact definition of Fourier
transform.

spectral leakage

This phenomenon is called spectral leakage because the amplitude of one harmonic leaks over the rest and masks its value.

feature extraction

Representation of speech signals in the frequency domain is especially useful because the frequency structure of a phoneme is generally unique.

Sinusoids are important because speech signals can be decomposed as sums of sinusoids.

For voiced sounds there is typically more energy at low frequencies
than at high frequencies, also called roll-off. To make the spectrograms easier to read, sometimes the signal is first preemphasized (typically with a first-order difference FIR filter) to boost the high frequencies
to counter the roll-off of natural speech.

Digital Systems

Linear Time-Invariant Systems and Linear Time-Varying Systems.

The Fourier Transform

Z-Transform

digital filter

filterbank

A filterbank is a collection of filters that span the whole frequency spectrum.

short-time analysis

Network Representation Learning

background

Recording studying during KTH. First blog about network representation learning, a project belongs to machine learning, advanced course.

LINE

Reproduce paper “LINE: Large-scale Information Network Embedding”.

Alias Table Method

It’s a method of effiently drawing samples from discrete distribution.
reference:
https://www.keithschwarz.com/darts-dice-coins/
https://blog.csdn.net/haolexiao/article/details/65157026

Negative Sampling

word2vec

Original paper:
Efficient estimation of word representations in vector space.
reference:
word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method

Skip-Gram Model

Original papaer:Distributed Representations of Words and Phrases
and their Compositionality.
The idea behind the word2vec models is that the words that appear in the same context (near each other) should have similar word vectors. Therefore, we should consider some notion of similarity in our objective when training the model. This is done using the dot product since when vectors are similar, their dot product is larger.
reference:
https://www.baeldung.com/cs/nlps-word2vec-negative-sampling

graphSage