# Gaussian process

## Preface

This book gives an introduction to Gaussian processes and shows their various, but not complete, applications. The introduction is aimed for users who want to apply the technique to solve practical engineering problems. With application examples, it shows how Gaussian processes can be used for machine learning to infer from known to unknown situations. The book serves as a reference for analytical Gaussian processes, computational operations with Gaussian processes, and solutions for special constraints that may arise in practical applications.

## Introduction

In probability theory, a Gaussian process is a stochastic process, i.e. a collection of random variables indexed by coordinates, with the property that every finite subset is multivariate normally distributed (or Gaussian-distributed). A Gaussian process represents a function whose function values can only be modeled with uncertainties and probabilities due to a lack of information. It is constructed from appropriate functions of mean values, variances and covariances that describe the function values as a continuum of correlated random variables in the form of an infinite-dimensional normal distribution. Thus, a Gaussian process is a probability distribution of functions. A sample drawn from it yields a random function with some specified and preferred properties.

### Applications

Gaussian processes are used for mathematical modeling of the behavior of non-deterministic systems on the basis of stochastic quantities or observations. Gaussian processes are suitable for signal analysis and synthesis, form a powerful tool in the interpolation, extrapolation, or smoothing of arbitrarily dimensional discrete measurement points (Gaussian process regression or kriging), and find application in classification problems. Gaussian processes, which are related to kernel methods,[1] can be used as a supervised machine learning technique for abstract modeling based on training examples. This Bayesian approach to machine learning has the advantage that it often does not require iterative training as neural networks do. Instead, Gaussian processes can be derived very efficiently with linear algebra from statistical quantities of the examples and are mathematically clearly interpretable and well controllable. Moreover, for interpolations and predictions, an associated confidence interval is computed for each individual output value, which accurately estimates its own prediction error, while correctly accounting for error propagation when the variance of the input values is known.

## Mathematical Description

### Definition

A Gaussian process is a special type of stochastic process ${\displaystyle (X_{t}){t\in T}}$  on any index set ${\displaystyle T}$ , if its finite-dimensional distributions are multivariate normal distributions (also Gaussian distributions) for all ${\displaystyle t_{1},t_{2},\dotsc ,t_{n}\in T}$ . That is, the multivariate distribution of ${\displaystyle (X{t_{1}},X_{t_{2}},\dotsc ,X_{t_{n}})}$  is given by an n-dimensional normal distribution.

Term: Although the term Gaussian process suggests temporal or sequential processes, this restriction does not exist. More suggestive would be the term Gaussian continuum in the sense of any continuum of correlated Gaussian distributions.

### Notation

In analogy to the one- and multidimensional Gaussian distribution, a Gaussian process is completely and uniquely determined by its first two moments. In the multidimensional Gaussian distribution, these are the expected value vector or mean vector ${\displaystyle {\vec {\mu }}}$  and the covariance matrix ${\displaystyle \sigma }$ . In the Gaussian process, these are replaced by an expected value function or mean function ${\displaystyle m(t):=\mathbb {E} (X_{t})}$  and a covariance function ${\displaystyle k(t,t'):=\operatorname {Cov} (X_{t},X_{t'})}$  ${\displaystyle :=\mathbb {E} \left[(X_{t}-\mathbb {E} (X_{t}))\cdot (X_{t'}-\mathbb {E} (X_{t'}))\right]}$ . These functions can be understood in the simplest one-dimensional case as a vector with continuous rows and as a matrix with continuous rows and columns. The following table compares one-dimensional and multidimensional Gaussian distributions with Gaussian processes. The tilde symbol ${\displaystyle \sim }$  can be read as "is distributed as".

Distribution type Notation Variables Probability density function
Univerate normal distribution ${\displaystyle X\sim {\mathcal {N}}(\mu ,\sigma ^{2})}$  ${\displaystyle X,\mu ,\sigma \in \mathbb {R} }$  ${\displaystyle p(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}\exp {\bigl \lbrace }-{\tfrac {1}{2}}(x-\mu )^{2}/{\sigma ^{2}}{\bigr \rbrace }}$
Multivariate normal distribution ${\displaystyle {\vec {X}}\sim {\mathcal {N}}_{n}({\vec {\mu }},\Sigma )}$  ${\displaystyle {\vec {X}},{\vec {\mu }}\in \mathbb {R} ^{n};\Sigma \in \mathbb {R} ^{n\times n}}$  ${\displaystyle p({\vec {x}})={\frac {1}{(2\pi )^{\frac {n}{2}}|\Sigma |^{\frac {1}{2}}}}\exp {\bigl \lbrace }-{\tfrac {1}{2}}({\vec {x}}-{\vec {\mu }})^{T}\Sigma ^{-1}({\vec {x}}-{\vec {\mu }}){\bigr \rbrace }}$
Gaussian process ${\displaystyle (X_{t})\sim {\mathcal {GP}}(m,k)}$  ${\displaystyle m\colon T\to \mathbb {R} }$
${\displaystyle k\colon T\times T\to \mathbb {R} }$
(no analytical representation)

The probability density function of a Gaussian process cannot be represented analytically because there is no corresponding notation for operations with continuous matrices. This gives the impression that one cannot perform computations with Gaussian processes in the same way as with finite-dimensional normal distributions. However, the essential property of the Gaussian process is not the infinity of the dimensions, but rather the assignment of the dimensions to the coordinates of a function. In practical applications, one always has to deal with a finite number of interpolation points and can therefore perform all calculations as in the finite-dimensional case. The limit of infinitely many dimensions is needed only in an intermediate step, namely if values are to be read out at new interpolated grid points. In this intermediate step, the Gaussian process, i.e. the mean function and covariance function, is represented or approximated by suitable analytical expressions. Here, the assignment to the grid points is done via the parameterized coordinates ${\displaystyle t}$  in the analytical expression. In the finite-dimensional case with discrete grid points the associated coordinates ${\displaystyle t_{i}}$  are assigned to the dimensions by their indices.

### Example of a Gaussian process

As a simple example, consider a Gaussian process

${\displaystyle (X_{t})\sim {\mathcal {GP}}(m,k)}$

with a scalar variable ${\displaystyle t}$  (time), given by the mean function

${\displaystyle m(t)=5\,{\text{Volt}}}$

and covariance function

${\displaystyle k(t,t')={\begin{cases}(1\,{\text{Volt}})^{2}&t=t'\\0&t\neq t'\end{cases}}}$

This Gaussian process describes an endless temporal electrical signal with Gaussian white noise with a standard deviation of one volt centered around a mean voltage of 5 volts.

### Definitions of special properties

A Gaussian process is called centered if its expected value or mean is constantly 0, that is, if ${\displaystyle m(t):=\mathbb {E} (X_{t})=0}$  for all ${\displaystyle t\in T}$ .

A Gaussian process is called stationary if its covariance function ${\displaystyle k(t,t'):=\operatorname {Cov} (X_{t},X_{t'})}$  is translation invariant, that is, it can be described by a relative function ${\displaystyle k(t,t')=k(t-t')}$ .

A Gaussian process with isotropic properties is called radial if its covariance function can be described by a radial symmetric and stationary function ${\displaystyle k(t,t')=k(|t-t'|)}$  with a one-dimensional parameter using the Euclidean norm ${\displaystyle |\cdot |}$ .

### List of Common Gaussian Processes and Covariance Functions

• Constant: ${\displaystyle m(t)=0}$  and ${\displaystyle k(t,t')=\sigma ^{2}}$
Corresponds to a constant value from a Gaussian distribution with standard deviation ${\displaystyle \sigma }$ .
• Offset: ${\displaystyle m(t)=c}$  and ${\displaystyle k(t,t')=0}$
Corresponds to a constant value given by ${\displaystyle c}$ .
• Gaussian White noise: ${\displaystyle k(t,t')=\sigma ^{2}\delta _{t,t'}}$
(${\displaystyle \sigma }$ : standard deviation, ${\displaystyle \delta }$ : Kronecker delta)
• Rational quadratic: ${\displaystyle k(r)=(1+r^{2})^{-\alpha },\quad \alpha \geq 0}$
• Gamma-exponential: ${\displaystyle k(r)=\exp \left(-\left({\frac {r}{\ell }}\right)^{\gamma }\right)}$
• Ornstein-Uhlenbeck / Gauss Markov: ${\displaystyle k(r)=\exp \left(-{\frac {r}{\ell }}\right)}$
Describes continuous, non-differentiable functions, as well as white noise after passing through an RC low-pass filter.
• Squared exponential: ${\displaystyle k(r)=\exp {\Big (}-{\frac {r^{2}}{2\ell ^{2}}}{\Big )}}$
Describes infinitely smooth differentiable functions.
• Matérn:
${\displaystyle k_{\nu =p+1/2}(r)=\exp \left(-{\frac {{\sqrt {2\nu }}r}{\ell }}\right){\frac {\Gamma (p+1)}{\Gamma (2p+1)}}\sum _{i=0}^{p}{\frac {(p+i)!}{i!(p-i)!}}\left({\frac {{\sqrt {8\nu }}r}{\ell }}\right)^{p-i}}$
A highly versatile Gaussian process used to describe most typical measurement curves. The functions of the Gaussian process are ${\displaystyle n}$  times continuously differentiable if ${\displaystyle \nu >n}$ . Covariance functions with ${\displaystyle \nu =1/2}$ , ${\displaystyle 3/2}$ , ${\displaystyle 5/2}$ , etc. correspond to white noise that has passed through 1, 2, or 3 RC low-pass filters or has been convolved with the function ${\displaystyle \exp \left(-|x|\right)}$ . Common special cases include:
${\displaystyle k_{\nu =3/2}(r)=\left(1+{\frac {{\sqrt {3}}r}{\ell }}\right)\exp \left(-{\frac {{\sqrt {3}}r}{\ell }}\right)}$
${\displaystyle k_{\nu =5/2}(r)=\left(1+{\frac {{\sqrt {5}}r}{\ell }}+{\frac {5r^{2}}{3\ell ^{2}}}\right)\exp \left(-{\frac {{\sqrt {5}}r}{\ell }}\right)}$
${\displaystyle k_{\nu =1/2}(r)}$  corresponds to the Ornstein-Uhlenbeck covariance function, and ${\displaystyle k_{\nu \rightarrow \infty }(r)}$  corresponds to the squared exponential function.
• Periodic: ${\displaystyle k(r)=\exp \left(-{\frac {2\sin ^{2}\left(\pi {\frac {r}{T}}\right)}{\ell ^{2}}}\right)}$
Functions from this Gaussian process are both periodic with period ${\displaystyle T}$  and smooth (squared exponential). If the square around the sine is replaced by the absolute value, non-smooth periodic functions result.
• Polynomial: ${\displaystyle k(t,t')=\left(t^{\top }t'+\sigma _{0}^{2}\right)^{p}}$
Grows rapidly outward and is usually a poor choice for regression problems, but can be useful in high-dimensional classification problems. It is positive semidefinite and does not necessarily generate invertible covariance matrices.[2]
• Brownian bridge: ${\displaystyle m(t)=0}$  and ${\displaystyle k(t,t')=\min(t,t')-tt'}$
• Wiener process: ${\displaystyle m(t)=0}$  and ${\displaystyle k(t,t')=\min(t,t')}$
Corresponds to the Brownian motion or integral over Gaussian white noise.
• Itō process: If ${\displaystyle T=\mathbb {R} _{+}}$  and ${\displaystyle f}$ , ${\displaystyle g}$  are two integrable real-valued functions and ${\displaystyle (W_{t})}$  is a Wiener process, then the Ito process
${\displaystyle X_{t}=\int _{0}^{t}f(s)\,\mathrm {d} s+\int _{0}^{t}g(s)\,\mathrm {d} W_{s}}$
is a Gaussian process with ${\displaystyle m(t)=\int _{0}^{t}f(s)\,\mathrm {d} s}$  and ${\displaystyle k(t,t')=\int _{0}^{\min(t,{t'})}g^{2}(s)\,\mathrm {d} s}$ .

Remarks:

• ${\displaystyle r:=\|t-t'\|}$  is the distance for stationary and radial covariance functions ${\displaystyle k(t,t')=k(r)}$ .
• ${\displaystyle \ell }$  is the characteristic length scale of the covariance function where the correlation has decayed to about ${\displaystyle e^{-1}}$ .
• Most stationary covariance functions ${\displaystyle k(r)}$  are normalized to ${\displaystyle k(0)=1}$  and are therefore equivalent to correlation functions. For use as covariance functions, they are multiplied by a variance ${\displaystyle \sigma ^{2}}$ , which assigns the variables a scaling and/or physical unit.
• Covariance functions cannot be arbitrary functions ${\displaystyle k(r)}$  or ${\displaystyle k(t,t')}$ , as it must be ensured that they are positive definite.[3] Positive semidefinite functions are also valid covariance functions, but it should be noted that these do not necessarily result in invertible covariance matrices and are therefore usually combined with a positive definite function.

## Mathematical operations with Gaussian processes

With Gaussian processes (or in general normal distributions) some stochastic operations can be performed by which different signals or functions can be combined or extracted from each other. In the following operations, Gaussian processes are represented in vector and matrix notation for finitely many support points ${\displaystyle y\sim {\mathcal {N}}\left(\mu ,\Sigma \right)}$ , which is analogously applicable to mean functions ${\displaystyle m(t)}$  and covariance functions ${\displaystyle k(t,t')}$ .

### Linear transformation

If the sum of two independent uncorrelated signals is built, then their mean functions and their covariance functions add up:

${\displaystyle y_{1}+y_{2}\sim {\mathcal {N}}\left(\mu _{1},\Sigma _{1}\right)+{\mathcal {N}}\left(\mu _{2},\Sigma _{2}\right)={\mathcal {N}}\left(\mu _{1}+\mu _{2},\Sigma _{1}+\Sigma _{2}\right)}$ .

The associated probability density functions thereby undergo a convolution.

For two completely correlated signals, the sum can be expressed by a scalar multiplication. If both signals are identical, the result is ${\displaystyle y+y=2y\sim {\mathcal {N}}\left(2\mu ,4\Sigma \right)}$ .

#### Difference: uncorrelated signals

If the difference of two independent uncorrelated signals is built, then their mean functions are subtracting while their covariance functions are adding:

${\displaystyle y_{1}-y_{2}\sim {\mathcal {N}}\left(\mu _{1},\Sigma _{1}\right)-{\mathcal {N}}\left(\mu _{2},\Sigma _{2}\right)={\mathcal {N}}\left(\mu _{1}-\mu _{2},\Sigma _{1}+\Sigma _{2}\right)}$ .

#### Subtraction of a Correlated Component

If the signal y2 of a Gaussian process describes a correlated additive component of the signal y1 of another Gaussian process, then subtracting this component results in the subtraction of the mean function and covariance function:

${\displaystyle y_{1}-y_{2}\sim {\mathcal {N}}\left(\mu _{1},\Sigma _{1}\right)\setminus {\mathcal {N}}\left(\mu _{2},\Sigma _{2}\right)={\mathcal {N}}\left(\mu _{1}-\mu _{2},\Sigma _{1}-\Sigma _{2}\right)}$

The backslash operator ${\displaystyle \setminus }$  was symbolically used here in the sense of "without the contained component".

#### Multiplication

The following multiplication with an arbitrary matrix ${\displaystyle F}$  also includes the special cases of the product with a function (diagonal matrix ${\displaystyle F}$ ) or with a scalar (${\displaystyle F=c\cdot \mathbb {I} }$ ):

${\displaystyle Fy\sim F\cdot {\mathcal {N}}\left(\mu ,\Sigma \right)={\mathcal {N}}\left(F\mu ,F\Sigma F^{\top }\right)}$

It should be noted here that the product of the functions of two Gaussian processes with each other would not result in another Gaussian process, since the resulting probability distribution would have lost the property of being Gaussian.

#### General linear transformation

All previously shown operations are special cases of the general linear transformation:

${\displaystyle A\cdot {\mathcal {N}}\left(\mu _{1},\Sigma _{1}\right)+B\cdot {\mathcal {N}}\left(\mu _{2},\Sigma _{2}\right)={\mathcal {N}}\left(A\mu _{1}+B\mu _{2},A\Sigma _{1}A^{\top }+B\Sigma _{2}B^{\top }+A\Sigma _{12}B^{\top }+B\Sigma _{12}^{\top }A^{\top }\right)}$

This relation[4] describes the sum ${\displaystyle A\cdot y_{1}+B\cdot y_{2}}$  with constant matrices ${\displaystyle A}$  and ${\displaystyle B}$  and the support point vectors ${\displaystyle y_{1}}$  and ${\displaystyle y_{2}}$  of the signals of two Gaussian processes with ${\displaystyle y_{1}\sim {\mathcal {N}}\left(\mu _{1},\Sigma _{1}\right)}$  and ${\displaystyle y_{2}\sim {\mathcal {N}}\left(\mu _{2},\Sigma _{2}\right)}$ . For partially correlated signals ${\displaystyle y_{1}}$  and ${\displaystyle y_{2}}$ , the cross-covariance matrix ${\displaystyle \Sigma _{12}}$  must also be given. The sum ${\displaystyle A\cdot y_{1}+B\cdot y_{2}}$  is correlated with ${\displaystyle y_{1}}$  by the cross-covariance matrix ${\displaystyle A\Sigma _{1}+B\Sigma _{12}}$  and with ${\displaystyle y_{2}}$  by ${\displaystyle A\Sigma _{12}+B\Sigma _{2}}$ .[5] A cross-covariance matrix ${\displaystyle \Sigma _{XY}}$  between two signals ${\displaystyle X}$  and ${\displaystyle Y}$  can be converted into a cross-correlation matrix ${\displaystyle C_{XY}}$  using their covariance matrices ${\displaystyle \Sigma _{X}}$  and ${\displaystyle \Sigma _{Y}}$  through the relation ${\displaystyle \left[C_{XY}\right]{ij}=\left[\Sigma {XY}\right]{ij}/{\sqrt {\left[\Sigma _{X}\right]{ii}\left[\Sigma _{Y}\right]_{jj}}}}$ .

### Fusion

If the same unknown function is described by two different Gaussian processes with uncorrelated errors to each other, then a union or fusion of the two pieces of information can be formed to achieve a reduction of the error:

${\displaystyle \Sigma _{\text{Fusion}}=\left(\Sigma _{1}^{-1}+\Sigma _{2}^{-1}\right)^{-1}}$
${\displaystyle \mu _{\text{Fusion}}=\Sigma _{\text{Fusion}}\Sigma _{1}^{-1}\mu _{1}+\Sigma _{\text{Fusion}}\Sigma _{2}^{-1}\mu _{2}}$

The result corresponds to the overlap or the normalized product of the two probability density functions and describes the most likely Gaussian process taking into account both parts of information (see also Inverse-variance weighting). The expressions can also be rearranged,[6] such that only one matrix inversion needs to be performed:

${\displaystyle \mu _{\text{Fusion}}=\mu _{1}-\Sigma _{1}\left(\Sigma _{1}+\Sigma _{2}\right)^{-1}\left(\mu _{1}-\mu _{2}\right)=\Sigma _{2}\left(\Sigma _{1}+\Sigma _{2}\right)^{-1}\mu _{1}+\Sigma _{1}\left(\Sigma _{1}+\Sigma _{2}\right)^{-1}\mu _{2}}$
${\displaystyle \Sigma _{\text{Fusion}}=\Sigma _{1}-\Sigma _{1}\left(\Sigma _{1}+\Sigma _{2}\right)^{-1}\Sigma _{1}=\Sigma _{1}\left(\Sigma _{1}+\Sigma _{2}\right)^{-1}\Sigma _{2}}$

The validity of the formula requires signal pairs with entirely uncorrelated errors. However, if there is partial correlation with cross-covariance ${\displaystyle \Sigma _{12}}$ , then the extended formula of Bar-Shalom and Campo applies,[7] where the correlated part is temporarily subtracted and then added back after fusion:

${\displaystyle \mu _{\text{Fusion}}=\mu _{1}-(\Sigma _{1}-\Sigma _{12})(\Sigma _{1}+\Sigma _{2}-\Sigma _{12}-\Sigma _{21})^{-1}(\mu _{1}-\mu _{2})}$
${\displaystyle \Sigma _{\text{Fusion}}=\Sigma _{1}-(\Sigma _{1}-\Sigma _{12})(\Sigma _{1}+\Sigma _{2}-\Sigma _{12}-\Sigma _{21})^{-1}(\Sigma _{1}-\Sigma _{21})}$

### Decomposition

A given signal ${\displaystyle y_{\text{sum}}}$  can be approximately decomposed into its additive components if the prior Gaussian processes of the components and the entire signal are given. According to the addition rule, the Gaussian process of the entire signal is composed of the prior Gaussian processes of the components:

${\displaystyle \mu _{\text{sum}}=\mu _{1}+\ldots +\mu _{n}}$
${\displaystyle \Sigma _{\text{sum}}=\Sigma _{1}+\ldots +\Sigma _{n}}$

The individual components ${\displaystyle y_{i}}$  can then be estimated using the posterior Gaussian processes

${\displaystyle \mu _{{\text{post,}}i}=\mu _{i}+\Sigma _{i}\Sigma _{\text{sum}}^{-1}\left(y_{\text{sum}}-\mu _{\text{sum}}\right)}$
${\displaystyle \Sigma _{{\text{post,}}i}=\Sigma _{i}-\Sigma _{i}\Sigma _{\text{sum}}^{-1}\Sigma _{i}^{\top }}$
and cross-covariances between the signals ${\displaystyle \Sigma _{{\text{post,}}i,j}=-\Sigma _{i}\Sigma _{\text{sum}}^{-1}\Sigma _{j}^{\top }}$ .

Apart from very specific cases, this decomposition is ambiguous. Therefore, the components are coupled probability distributions of possible solutions around the most likely solutions (see also Example: Signal Decomposition).

## Gaussian process regression

### Introduction

Gaussian processes can be used to interpolate, extrapolate, or smooth discrete measurement data of a mapping ${\displaystyle \mathbb {R} ^{n}\to \mathbb {R} }$ . This application of Gaussian processes is called Gaussian process regression. The method is often called kriging for historical reasons, especially in the spatial domain. It is particularly suitable for problems for which no specific model function is known. Its property as a machine learning method allows automatic model building based on observations. In this application, a Gaussian process captures the typical behavior of the system, which can be used to derive the optimal interpolation for the problem. The result is a probability distribution of possible interpolation functions and the solution with the highest probability.

### Overview of the individual steps

The calculation of a Gaussian process regression can be performed by the following steps:

1. Prior mean function: If there is a consistent trend in the measured values, a prior mean function is constructed to equalize the trend.
2. Prior covariance function: The covariance function is selected according to certain qualitative properties of the system or composed from covariance functions of different properties according to certain rules.
3. Fine-tuning of parameters: to obtain quantitatively correct covariances, the selected covariance function is adjusted to the available measured values either targeted or by an optimization procedure until the covariance function reflects the empirical covariances.
4. Conditional distribution: By considering known measured values, the conditional posterior Gaussian process is calculated from the prior Gaussian process for new support points with still unknown values.
5. Interpretation: Finally, from the posterior Gaussian process, the mean function is taken as the best possible interpolation and, if required, the diagonal of the covariance function is taken as the location-dependent variance.

### Step 2: Prior covariance function

In practical applications, a Gaussian process must be determined from finitely many discrete measured values or finitely many sample curves. In analogy to the one-dimensional Gaussian distribution, which is completely determined by the mean and standard deviation of discrete measured values, one would expect several single but complete functions ${\displaystyle f_{i}(t)}$  in order to calculate the mean function

${\displaystyle m(t)={\frac {1}{N}}\sum _{i=1}^{N}f_{i}(t)}$

and the (empirical) covariance function

${\displaystyle k(t,t')={\frac {1}{N-1}}\sum _{i=1}^{N}\left[f_{i}(t)-m(t)\right]\cdot \left[f_{i}(t')-m(t')\right]}$ .

#### Regression problem and stationary covariance

Often, however, no such distribution of exemplary functions is available. In the regression problem instead only discrete interpolation points of a single function are known, which are to be interpolated or smoothed. Also in such a case a Gaussian process can be determined. For this purpose, instead of this single function, a set of many copies of the function shifted to each other is considered. This distribution can now be described with the help of a covariance function. Usually it can be expressed as a relative function of this shift by ${\displaystyle k(t,t')=k(t'-t)}$ . It is then called stationary covariance function and applies equally to all locations of the function and describes the everywhere equal (thus stationary) correlation of each point to its neighborhood, as well as the correlation of neighboring points among each other.

The covariance function is represented analytically and determined heuristically or looked up in the literature. The free parameters of the analytical covariance functions are fitted to the measured values. Many physical systems have a similar form of the stationary covariance function, so that with a few tabulated analytical covariance functions most applications can be described. For example, there are covariance functions for abstract properties such as smoothness, roughness (lack of differentiability), periodicity or noise, which can be combined and fitted according to certain rules to reproduce the properties of the measured values.

#### Examples of stationary covariance

The following table shows examples of covariance functions with such abstract properties. The example curves are random samples of the respective Gaussian process and represent typical function shapes. They were generated with the corresponding covariance matrix ${\displaystyle \Sigma _{ij}=k(t_{i},t_{j})}$  and a random generator for multidimensional normal distributions as correlated random vector. The stationary covariance functions ${\displaystyle k(t,t')}$  are abbreviated here as one-dimensional functions ${\displaystyle k(r)}$  with ${\displaystyle r:=|t-t'|}$ .

Properties Examples of stationary covariance functions Random functions ${\displaystyle f(t)}$
Constant ${\displaystyle k(r)=1}$
Smooth ${\displaystyle k(r)=\exp \left(-r^{2}/5\right)}$
Rough ${\displaystyle k(r)=\exp \left(-r/15\right)}$
Periodic ${\displaystyle k(r)=\exp \left(-\left|\sin \left(0{,}4\pi r\right)\right|/2.5\right)}$
Noise {\displaystyle {\begin{aligned}k(r)={\begin{cases}0.2:&r=0\\0:&r\neq 0\end{cases}}\end{aligned}}}
Mixed
(periodic,
smooth and
noisy)
{\displaystyle {\begin{aligned}k(r)=&\,\exp \left(-\sin ^{2}({\tfrac {\pi }{2}}r)/4\right.\\&\left.-r^{2}/40\right)+{\begin{cases}0.005:&r=0\\0:&r\neq 0\end{cases}}\end{aligned}}}

#### Construction of new covariance functions

The properties can be combined according to certain computational rules. The basic goal in constructing a covariance function is to reproduce the true covariances as precisely as possible, while at the same time satisfying the condition of positive definiteness. The examples shown, except for the constant, have the latter property, and the additions and multiplications of such functions also remain positive definite. The constant covariance function is only positive semidefinite and must be combined with at least one positive definite function. The lowest covariance function in the table shows a possible mixture of different properties. The functions in this example are periodic over a certain distance, have a relatively smooth behavior and are overlaid with a certain measurement noise.

For mixed properties, the following rules applies:[8]

• In the case of additive effects, the covariances are added, as for example in the superposition of measurement noise.
• For reinforcing or mitigating effects to each other, the covariances are multiplied, such as in case of the slow decay of periodicity.

#### Multidimensional functions

What is shown here with one-dimensional functions can be transferred analogously also to multi-dimensional systems, by simply replacing the distance ${\displaystyle r}$  by a corresponding n-dimensional distance norm. The support points in the higher dimensions are unrolled in an arbitrary order and represented by vectors, so that they can be processed in the same way as in the one-dimensional case. The following two figures show two examples with two-dimensional Gaussian processes and different stationary and radial covariance functions. In the respective right figure a random draw of the Gaussian process is shown.

#### Non-stationary covariance functions

Gaussian processes can also have non-stationary properties of the covariance function, that is, relative covariance functions that change as a function of location. The literature describes how nonstationary covariance functions can be constructed so that positive definiteness is ensured here as well. A simple possibility is, for example, an interpolation of different covariance functions over the location with the inverse distance weighting.

### Step 3: Fine tuning of parameters

The qualitatively constructed covariance functions contain parameters, called hyperparameters, which must be tuned to the system in order to obtain quantitatively correct results. This can be done by direct knowledge about the system, e.g., the known value of the standard deviation of the measurement noise or the prior standard deviation of the overall system (sigma prior, the square corresponds to the diagonal elements of the covariance matrix).

However, the parameters can also be adjusted automatically. For this purpose, one uses the marginal likelihood, i.e., the probability density for a given measured curve as a metric for the agreement between the assumed Gaussian process and the existing measured curve. The parameters are then optimized to maximize this agreement. Since the exponential function is strictly monotone, it is sufficient to maximize the exponent of the probability density function, the so-called log-marginal likelihood function[9]

${\displaystyle \log p(\mathbf {y} )=-{\frac {1}{2}}\mathbf {y} ^{\top }\Sigma ^{-1}\mathbf {y} -{\frac {1}{2}}\log |\Sigma |-{\frac {n}{2}}\log(2\pi )}$

with the measurement vector ${\displaystyle \mathbf {y} }$  of length ${\displaystyle n}$  and the hyperparameter-dependent covariance matrix ${\displaystyle \Sigma }$ . Mathematically, maximizing the marginal likelihood causes an optimal tradeoff between accuracy (minimizing the residuals) and simplicity of the theory. A simple theory is characterized by large non-diagonal elements, describing a high correlation in the system. This means that there are few degrees of freedom in the system and thus, in some sense, the theory can cope with few rules to explain all correlations. If these rules are chosen too simple, the measurements would not be reproduced sufficiently well and the residual errors grow too much. At a maximum marginal likelihood, the equilibrium of an optimal theory is found, provided that sufficiently many measurement data were available for good conditioning. This implicit property of the maximum likelihood estimation can also be understood as Ockham's parsimony principle.

### Step 4: Conditional Gaussian process with known support points

If the Gaussian process of a system has been determined as described above, i.e. if the prior mean function and covariance function are known, a prediction of arbitrary interpolated intermediate values can be computed with the Gaussian process, when only a few support points of the desired function are known by measurements. The prediction is done by the conditional probability of a multidimensional Gaussian distribution given a partial information. The dimensions of the multidimensional Gaussian distribution

${\displaystyle {X}={\binom {{X}_{\text{U}}}{{X}_{\text{K}}}}\sim {\mathcal {N}}\left({\binom {{\mu }_{\text{U}}}{{\mu }_{\text{K}}}},{\begin{pmatrix}{\Sigma }_{\text{UU}}&{\Sigma }_{\text{UK}}\\{\Sigma }_{\text{KU}}&{\Sigma }_{\text{KK}}\end{pmatrix}}\right)}$

are divided into unknown values to be predicted (index U for unknown) and known measured values (index K for known). Vectors thereby decompose into two parts. The covariance matrix is accordingly divided into four blocks: Covariances within the unknown values (UU), within the known measured values (KK) and covariances between the unknown and known values (UK and KU). The values of the covariance matrix are taken at discrete points of the covariance function and the mean vector at corresponding points of the mean function: ${\displaystyle \Sigma _{ij}=k(t_{i},t_{j})}$  or ${\displaystyle \mu _{i}=m(t_{i})}$ .

By considering the known measured values ${\displaystyle X_{\text{K}}}$ , the distribution changes to the conditional or posterior normal distribution

${\displaystyle X_{\text{U}}\mid X_{\text{K}}\sim {\mathcal {N}}\left(\mu _{\text{U}}+\Sigma _{\text{UK}}\Sigma _{\text{KK}}^{-1}(X_{\text{K}}-\mu _{\text{K}}),\Sigma _{\text{UU}}-\Sigma _{\text{UK}}\Sigma _{\text{KK}}^{-1}\Sigma _{\text{KU}}\right)}$ ,

where ${\displaystyle X_{\text{U}}}$  are the unknown variables to be determined. The notation ${\displaystyle \mid X_{\text{K}}}$  reads as "given ${\displaystyle X_{\text{K}}}$ ", which means under the condition that ${\displaystyle X_{\text{K}}}$  is given.

The first parameter of the resulting Gaussian distribution describes the new mean vector we are looking for, which now corresponds to the most likely function values of the interpolation. In addition, the entire predicted new covariance matrix is given in the second parameter. In particular, this contains the confidence intervals of the predicted mean values, given by the root of the main diagonal elements.

#### Measurement noise and other interfering signals

White measurement noise of variance ${\displaystyle \sigma _{\text{noise}}^{2}}$  can be modeled as part of the prior covariance model by adding appropriate terms to the diagonal of ${\displaystyle \Sigma _{\text{KK}}}$ . If the same covariance function is also used to form the matrix ${\displaystyle \Sigma _{\text{UU}}}$ , the predicted distribution would also describe a white noise of variance ${\displaystyle \sigma _{\text{noise}}^{2}}$ . To obtain a prediction of an noise free signal, in the posterior distribution

${\displaystyle X_{\text{U}}\mid X_{\text{K}}\sim {\mathcal {N}}\left(\mu _{\text{U}}+\Sigma _{\text{UK}}\left[\Sigma _{\text{KK}}+\mathbb {I} \sigma _{\text{noise}}^{2}\right]^{-1}(X_{\text{K}}-\mu _{\text{K}}),\Sigma _{\text{UU}}-\Sigma _{\text{UK}}\left[\Sigma _{\text{KK}}+\mathbb {I} \sigma _{\text{noise}}^{2}\right]^{-1}\Sigma _{\text{KU}}\right)}$

the corresponding terms are omitted at ${\displaystyle \Sigma _{\text{UU}}}$  and if applicable in ${\displaystyle \Sigma _{\text{UK}}}$  and ${\displaystyle \Sigma _{\text{KU}}}$ . This averages out the measurement noise as good as possible, which is also correctly accounted for in the predicted confidence interval. In the same way, any unwanted additive noise signal can be removed from the measurement data (see also arithmetic operation decomposition), provided that it can be described by a covariance function and is sufficiently well distinguishable from the desired signal component. For this purpose, instead of the diagonal matrix ${\displaystyle \mathbb {I} \sigma _{\text{noise}}^{2}}$ , the corresponding covariance matrix of the interference ${\displaystyle \Sigma _{\text{noise}}}$  is used. Measurements with noisy signals thus require two covariance models: ${\displaystyle k(t,t')}$  for the desired signal component to be estimated and ${\displaystyle k(t,t')+k_{\text{noise}}(t,t')}$  for the raw signal.

#### Derivation of the conditional distribution

The derivation can be done via the Bayes formula by substituting the two probability densities for known and unknown support points and the composite probability density. The resulting conditional posterior normal distribution corresponds to the overlap or intersection of the Gaussian distribution with the subvector space spanned by the known values.

For noisy measurements that are themselves a multidimensional normal distribution, the overlap to the prior distribution is obtained by multiplying the two probability densities. The product of the probability densities of two multidimensional normal distributions corresponds to the arithmetic operation Fusion, which can be used to derive the distribution where the noise is suppressed.

#### Posterior Gaussian process

In the full notation as a Gaussian process, the posterior Gaussian process yields

${\displaystyle (X_{t})\sim {\mathcal {GP}}(m,k)}$

and the ${\displaystyle n}$  known measurements ${\displaystyle \mathbf {x} =(x_{1},x_{2},\ldots ,x_{n})}$  at the coordinates ${\displaystyle \mathbf {t} =(t_{1},t_{2},\ldots ,t_{n})}$  a new distribution, given by the conditional posterior Gaussian process

${\displaystyle (X_{t}\mid \mathbf {t} ,\mathbf {x} )\sim {\mathcal {GP}}\left(m_{\mathrm {post} },k_{\mathrm {post} }\right)}$

Here, ${\displaystyle K}$  is a covariance matrix obtained by evaluating the covariance function ${\displaystyle k}$  at discrete rows ${\displaystyle t_{i}}$  and columns ${\displaystyle t_{j}}$ . Moreover, ${\displaystyle \mathbf {k} }$  was appropriately formed as a vector of functions by evaluating ${\displaystyle k}$  only at discrete rows or only at discrete columns.

In practical numerical calculations with finite numbers of support points, only the equation of the conditional multivariate normal distribution is used. The notation of the posterior Gaussian process serves here only the theoretical understanding, in order to describe the limit towards the continuum in the form of functions and thus to depict the assignment of the values to the coordinates.

### Step 5: Interpretation

From the prior Gaussian process, the measured values are used to obtain a posterior Gaussian process, which takes into account the known partial information. This result of the Gaussian process regression represents not only one solution, but the entirety of all possible solution functions of the interpolation weighted with different probabilities. The indecision expressed in this way is not a weakness of the method. It does perfect justice to the problem, since in the case of a theory which is not completely known or in the case of noisy measurements, the solution, in principle, cannot be determined unambiguously. Mostly, however, we are specifically interested in at least the solution with the highest probability. This is given by the mean function ${\displaystyle m_{\mathrm {post} }(t)}$  in the first parameter of the posterior Gaussian process. From the conditional covariance function in the second parameter, it is possible to obtain the scatter around this solution. The diagonal ${\displaystyle k_{\mathrm {post} }(t,t)}$  of the covariance function gives a function with the variances of the predicted most likely function. The confidence interval is then given by the bounds ${\displaystyle m_{\mathrm {post} }(t)\pm {\sqrt {k_{\mathrm {post} }(t,t)}}}$ .

The Python code for the examples can be found on the respective image description page.

## Special cases

### Underdetermined measurments

In some cases of conditional Gaussian processes, groups of linearly related measured values are completely indeterminate. E.g., this is the case for indirect measurements following from underdetermined equations, such as with a noninvertible positive semidefinite matrix of the form ${\displaystyle A^{\top }\Sigma ^{-1}A}$ . The grid points then cannot be easily partitioned into known and unknown values, and the associated covariance matrix would be singular due to infinite uncertainties. This would correspond to a normal distribution that is infinitely stretched in certain spatial directions transverse to the coordinate axes. To account for the relationships between the undetermined variables, in such a case, the inverse matrix ${\displaystyle \Sigma _{\text{2}}^{-1}}$ , called the precision matrix, must be used. This can describe completely undetermined measurements, which is expressed by zeros in the diagonal. For such a singular distribution ${\displaystyle {\mathcal {N}}\left(\mu _{\text{2}},\Sigma _{\text{2}}\right)}$  with partially unknown measurements ${\displaystyle \mu _{\text{2}}}$  and singular measurement uncertainties ${\displaystyle \Sigma _{\text{2}}}$ , the wanted posterior distribution is obtained by the overlap to the prior Gaussian process model ${\displaystyle {\mathcal {N}}\left(\mu _{\text{1}},\Sigma _{\text{1}}\right)}$  calculated by multiplying the probability densities. The union of the two normal distributions

${\displaystyle \Sigma _{\text{Fusion}}=\left(\mathbb {I} +\Sigma _{\text{1}}\Sigma _{\text{2}}^{-1}\right)^{-1}\Sigma _{\text{1}}}$
${\displaystyle \mu _{\text{Fusion}}=\left(\mathbb {I} +\Sigma _{\text{1}}\Sigma _{\text{2}}^{-1}\right)^{-1}\mu _{\text{1}}+\Sigma _{\text{Fusion}}\Sigma _{\text{2}}^{-1}\mu _{\text{2}}}$

is obtained by the Fusion operation after appropriate transformation, so that the singular of the two matrices remains inverse. The result is always a finite distribution, since the finite matrix dominates. If both are finite, the equation can be put into the form of the posterior Gaussian process as in the section on the conditional distribution.

### Linear combination to a Gaussian process

From given basis functions ${\displaystyle \varphi _{j}(t)}$  a linear combination is to be formed, which has maximum overlap with the distribution ${\displaystyle {\mathcal {N}}(\mu ,\Sigma )}$  of an associated Gaussian process ${\displaystyle {\mathcal {GP}}(m,k)}$ . Or measured values ${\displaystyle \mu }$  are to be approximated, while the interfering signal ${\displaystyle {\mathcal {N}}(0,\Sigma )}$ , contained within, is ignored as far as possible. In both cases, the wanted coefficients can be calculated using generalized least squares estimation

${\displaystyle c=\left(A^{\top }\Sigma ^{-1}A\right)^{-1}A^{\top }\Sigma ^{-1}\mu }$
${\displaystyle \Sigma _{c}=\left(A^{\top }\Sigma ^{-1}A\right)^{-1}}$

The matrix ${\displaystyle A_{ij}=\varphi _{j}(t_{i})}$  contains the function values of the basis functions ${\displaystyle \varphi _{j}(t)}$  at the interpolation points ${\displaystyle t_{i}}$ . The resulting coefficients c with the associated covariance matrix ${\displaystyle \Sigma _{c}}$  describe the linear combination with the largest possible probability density in the distribution ${\displaystyle {\mathcal {N}}(\mu ,\Sigma )}$ . The linear combination thereby approximates the mean function or the measured values ${\displaystyle \mu }$  in such a way that the residuals are best described by the covariance matrix ${\displaystyle \Sigma }$ . The method is used, for example, in the program library Scikit-learn to empirically estimate a polynomial mean function of a Gaussian process.

### Approximation of an empirical Gaussian process

An empirically determined Gaussian process

${\displaystyle m(t)={\frac {1}{N}}\sum _{p=1}^{N}f_{p}(t)}$
${\displaystyle k(t,t')={\frac {1}{N-1}}\sum _{p=1}^{N}\left[f_{p}(t)-m(t)\right]\cdot \left[f_{p}(t')-m(t')\right]}$

from example functions ${\displaystyle f_{p}(t)}$  with few distinct degrees of freedom can be approximated and simplified by means of the eigenvalue decomposition or singular value decomposition

${\displaystyle \Sigma =VSV^{\top }}$

of the covariance matrix ${\displaystyle \Sigma _{ij}=k(t_{i},t_{j})}$ . This is done by choosing the ${\displaystyle n}$  largest eigenvalues or singular values ${\displaystyle \lambda _{p}=\sigma _{p}^{2}}$  from the diagonal matrix ${\displaystyle S}$ . The corresponding columns ${\displaystyle v_{p}}$  of ${\displaystyle V}$  are the principal components of the Gaussian process (see Principal Component Analysis). If the columns are represented as functions ${\displaystyle v_{p}(t)}$ , then the original Gaussian process is represented by the mean function ${\displaystyle m(t)}$  and the covariance function

${\displaystyle k(t,t')\approx \sum _{p=1}^{n}\sigma _{p}^{2}v_{p}(t)v_{p}(t')}$

This Gaussian process describes exclusively functions of the linear combination

${\displaystyle f(t)=m(t)+\sum _{p}c_{p}v_{p}(t)}$ ,

where each coefficient ${\displaystyle c_{p}}$  is scattered around zero mean as an independent random variable of variance ${\displaystyle \sigma _{p}^{2}=\lambda _{p}}$ .

Such a simplification is positively semidefinite and it usually lacks the properties to describe small-scale variations. These properties can be added to the covariance function in the form of a stationary covariance function fitted to the residuals:

${\displaystyle k(t,t')\approx \sum _{p=1}^{n}\sigma _{p}^{2}v_{p}(t)v_{p}(t')+k_{\text{stat}}(t'-t)}$

## Application examples

### Example: Trend prediction

In a hypothetical application example from market research, the future demand for the topic "snowboard" is to be predicted. For this purpose, an extrapolation of the number of Google searches[10] on this term is to be calculated.

In the past data, one can see a periodic, but non-sinusoidal seasonal dependence, which can be explained by the winter in the northern hemisphere. Moreover, the trend decreased continuously over the last decade. In addition, one recognizes a recurring increase in search queries during the Olympic Games every four years. The covariance function was therefore modeled with a slow trend and a one- and four-year period:

 ${\displaystyle k(r)=0{,}8\exp \left(-{\tfrac {1}{2}}\left|\sin(\pi r)\right|-|r/25|^{2}-2{,}5\right)+(0{,}2-0{,}01)\exp \left(-\left|\sin \left({\tfrac {1}{4}}\pi r\right)\right|/0{,}2\right)+0{,}01\exp \left(-r/45\right)}$

The trend also appears to have a significant asymmetry. This can be the case if the underlying random effects do not add up but reinforce each other, resulting in a Log-Normal Distribution. However, the logarithm of such values describes a normal distribution, to which Gaussian process regression can be applied.

The figure shows an extrapolation of the curve (to the right of the dashed line). Since the results here were transformed back from the logarithmic plot using an exponential function, the predicted confidence intervals are correspondingly asymmetrical (gray area). The extrapolation plausibly shows the seasonal patterns and also the increase in searches for the Olympic Games every four years. The example with mixed properties demonstrates very well the versatile modeling possibilities of the Gaussian process regression, which are unified in a single interpolation method.

### Example: Sensor calibration

In an application example from industry, sensors are to be calibrated using Gaussian processes. Due to tolerances during manufacturing, the characteristic curves ${\displaystyle f(x)}$  of the sensors show large individual differences. This causes high costs in calibration, since a complete characteristic curve would have to be measured for each sensor. However, the effort can be minimized by learning the exact behavior of the scattering by a Gaussian process. For this purpose, the complete characteristic curves ${\displaystyle f_{i}(x)}$  of ${\displaystyle N}$  randomly selected representative sensors are measured and thus the Gaussian process ${\displaystyle {\mathcal {GP}}(m,k)}$  of the scattering is calculated by

${\displaystyle m(x)={\frac {1}{N}}\sum _{i=1}^{N}f_{i}(x)}$
${\displaystyle k(x,x')={\frac {1}{N-1}}\sum _{i=1}^{N}\left[f_{i}(x)-m(x)\right]\cdot \left[f_{i}(x')-m(x')\right]}$

In the example shown, 15 representative characteristic curves are given. The resulting Gaussian process is represented by the mean function ${\displaystyle m(x)}$  and the confidence interval ${\displaystyle m(x)\pm {\sqrt {k(x,x)}}}$ .

With the conditional Gaussian process ${\displaystyle {\mathcal {GP}}(m_{\text{post}},k_{\text{post}})}$  with

${\displaystyle m_{\mathrm {post} }(x)=m(x)+\mathbf {k} ^{\top }(x,\mathbf {x} )K(\mathbf {x} ,\mathbf {x} )^{-1}(\mathbf {y} -m(\mathbf {x} ))}$
${\displaystyle k_{\mathrm {post} }(x,x')=k(x,x')-\mathbf {k} ^{\top }(x,\mathbf {x} )K(\mathbf {x} ,\mathbf {x} )^{-1}\mathbf {k} (\mathbf {x} ,x')}$

the complete characteristic map can now be reconstructed for each new sensor with a few individual measured values ${\displaystyle \mathbf {y} }$  at the coordinates ${\displaystyle \mathbf {x} }$ . The number of measured values must correspond at least to the number of degrees of freedom of the tolerances, which have an independent linear influence on the shape of the characteristic curve.

In the example shown, a single measured value is not yet sufficient to determine the characteristic curve unambiguously and precisely. The confidence interval shows the region of the curve which is not yet sufficiently accurate. With another measured value in this range, the remaining uncertainty can finally be completely eliminated. The exemplary fluctuations of the very differently acting sensors in this example thus seem to be caused by the tolerances of only two relevant inner degrees of freedom.

### Example: Signal decomposition

In a signal processing application example, a temporal signal is to be decomposed into its components. Let it be known about the system that the signal consists of three components following the three covariance functions

${\displaystyle k_{1}(r)=2{,}7^{2}\exp(-r^{2})}$
${\displaystyle k_{2}(r)=2{,}7^{2}\exp(-0{,}4|\sin(r/2{,}5)|)}$
${\displaystyle k_{3}(r)=0{,}6^{2}\delta _{r}}$

The sum signal then follows the addition rule of the covariance function

${\displaystyle k_{\text{sum}}(r)=k_{1}(r)+k_{2}(r)+k_{3}(r)}$ .

The following two figures show three random signals which were generated and added for demonstration with these covariance functions. In the sum of the signals one can hardly recognize the periodic signal hidden in it with the naked eye, since its spectral range overlaps with that of the two other components.

With the help of the operation decomposition the sum ${\displaystyle y_{\text{sum}}}$  can be decomposed again into the three components

${\displaystyle y_{1}=\Sigma _{1}\Sigma _{\text{sum}}^{-1}y_{\text{sum}}+3}$
${\displaystyle y_{2}=\Sigma _{2}\Sigma _{\text{sum}}^{-1}y_{\text{sum}}-3}$
${\displaystyle y_{3}=\Sigma _{3}\Sigma _{\text{sum}}^{-1}y_{\text{sum}}}$

where ${\displaystyle (\Sigma _{x})_{ij}=k_{x}(|t_{j}-t_{i}|)}$ . The estimate of the most likely decomposition shows how well the separation is possible in this case and how close the signals are to the original signals. The estimated uncertainties taking into account the cross-correlations are shown in the animation by random fluctuations.

The example shows how this method can be used to separate very different signals in one step. In contrast, other filtering methods such as moving averaging, Fourier filtering, polynomial regression, or spline approximation are optimized for specific signal characteristics and provide neither accurate error estimates nor cross-correlations.

If the Gaussian processes of the individual components for a given signal are not precisely known, then in some cases hypothesis testing can be performed using the log-marginal likelihood function, provided sufficient data are available to well-condition the function. Via its maximization, the parameters of the conjectured covariance functions can be fitted to the measured data.

## Literature

• C. E. Rasmussen. "Gaussian Processes in Machine Learning". doi:10.1007/978-3-540-28650-9_4. {{cite journal}}: Cite journal requires |journal= (help) In: Olivier Bousquet (publisher): Advanced Lectures on Machine Learning. ML 2003. (= Lecture Notes in Computer Science. vol. 3176). Springer, Berlin/Heidelberg 2004.([1], pdf)
• C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning. MIT Press, 2006, ISBN 0-262-18253-X. (gaussianprocess.org, pdf)
• R. M. Dudley: Real Analysis and Probability. Wadsworth and Brooks/Cole, 1989.
• B. Simon: Functional Integration and Quantum Physics. Academic Press, 1979.
• M. L. Stein: Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.

## References

1. Kanagawa, M., Hennig, P., Sejdinovic, D., Sriperumbudur, B. K. (2018), Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences, arXiv{{citation}}: CS1 maint: uses authors parameter (link)
2. C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning MIT Press, 2006, ISBN 0-262-18253-X, Chapter 4.2.2 Dot Product Covariance Functions. p. 89 and Table 4.1, p. 94.
3. C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning. MIT Press, 2006, ISBN 0-262-18253-X, Chapter 4 "Covariance Functions", valid covariance functions are listed as "ND" in Table 4.1 on page 94.
4. The derivation of the general linear transformation is based on the equation ${\displaystyle F\cdot {\mathcal {N}}\left(\mu ,\Sigma \right)={\mathcal {N}}\left(F\mu ,F\Sigma F^{\top }\right)}$ , by choosing the matrix F as [A B], ${\displaystyle \mu }$  as a vector (${\displaystyle \mu _{1}}$  ${\displaystyle \mu _{2}}$ ) and ${\displaystyle \Sigma }$  from corresponding four blocks.
5. The derivation is based on the covariance rule for multiplication ${\displaystyle {\text{cov}}(Ax,By)=A{\text{cov}}(x,y)B^{\top }}$  and associativity ${\displaystyle {\text{cov}}(x,y+z)={\text{cov}}(x,y)+{\text{cov}}(x,z)}$ .
6. The transformation involves, for example, multiplying by 1 = Σ11 or adding 0 = Σ11 and truncating the inverse matrices accordingly.
7. Bar-Shalom, Yaakov; Campo, Leon (November 1986). "The Effect of the Common Process Noise on the Two-Sensor Fused-Track Covariance". IEEE Transactions on Aerospace and Electronic Systems. AES-22 (6): 803–805. doi:10.1109/TAES.1986.310815.
8. C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning, Chapter 4.2.4 Making New Kernels from Old. S. 94.
9. C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning, Chapter 5.2 Bayesian Model Selection. S. 94.
10. The data is available at Google trends for the search term "snowboard".