Basically, all the sections found here can be also found in a linear algebra book. However, the Gram-Schmidt Orthogonalization is used in statistical algorithm and in the solution of statistical problems. Therefore, we briefly jump into the linear algebra theory which is necessary to understand Gram-Schmidt Orthogonalization.
The following subsections also contain examples. It is very important for further understanding that the concepts presented here are not only valid for typical vectors as tuple of real numbers, but also functions that can be considered vectors.
It can easily be proven that real numbers with the well known addition and multiplication are a field. The same holds for complex numbers with the addition and multiplication. Actually, there are not many more sets with two operations which fulfill all of these conditions.
For statistics, only the real and complex numbers with the addition and multiplication are important.
A dimension of a vector space is the number of vectors which are necessary for a basis. A vector space has infinitely many number of basis, but the dimension is uniquely determined. Note that the vector space may have a dimension of infinity, e.g. consider the space of continuous functions.
Two vectors and are orthogonal to each other if . In it holds that the cosine of the angle between two vectors can expressed as
.
If the angle between and is ninety degree (orthogonal) then the cosine is zero and it follows that .
A set of vectors is called orthonormal, if
.
If we consider a basis of a vector space then we would like to have a orthonormal basis. Why ?
Since we have a basis, each vector and can be expressed by and . Therefore the scalar product of and reduces to
Consequently, the computation of a scalar product is reduced to simple multiplication and addition if the coefficients are known.
Remember that for our polynomials we would have to solve an integral!
The aim of the Gram-Schmidt orthogonalization is to find for a set of vectors an equivalent set of orthonormal vectors such that any vector which can be expressed as linear combination of can also be expressed as linear combination of :
1. Set and
2. For each set and , in each step the vector is projected on and the result is subtracted from .
Consider the polynomials of degree two in the interval with the scalar product and the norm . We know that and are a basis for this vector space. Let us now construct an orthonormal basis:
Step 1a:
Step 1b:
Step 2a:
Step 2b:
Step 3a:
Step 3b:
It can be proven that and form a orthonormal basis with the above scalarproduct and norm.
Consider the vectors and . Assume that is so small that computing holds on a computer (see http://en.wikipedia.org/wiki/Machine_epsilon). Let compute a orthonormal basis for this vectors in with the standard scalar product and the norm .
Step 1a.
Step 1b. with
Step 2a.
Step 2b.
Step 3a.
Step 3b.
It obvious that for the vectors
-
-
-
the scalarproduct . All other pairs are also not zero, but they are multiplied with such that we get a result near zero.
To solve the problem a modified Gram-Schmidt algorithm is used:
Set for all
for each from to compute
for each from to compute
The difference is that we compute first our new and subtract it from all other . We apply the wrongly computed vector to all vectors instead of computing each separately.
In the analysis of high-dimensional data we usually analyze projections of the data. The approach results from the Theorem of Cramer-Wold that states that the multidimensional distribution is fixed if we know all one-dimensional projections. Another theorem states that most (one-dimensional) projections of multivariate data are looking normal, even if the multivariate distribution of the data is highly non-normal.
Therefore in Exploratory Projection Pursuit we judge the interestingness of a projection by comparison with a (standard) normal distribution. If we assume that the one-dimensional data are standard normal distributed then after the transformation with the cumulative distribution function of the standard normal distribution then is uniformly distributed in the interval .
Thus the interesting can measured by with a density estimated from the data. If the density is equal to in the interval then the integral becomes zero and we have found that our projected data are normally distributed. An value larger than zero indicates a deviation from the normal distribution of the projected data and hopefully an interesting distribution.
Let a set of orthonormal polynomials with the scalar product and the norm . What can we derive about a densities in the interval ?
If for some maximal degree then it holds
We can also write or empirically we get an estimator .
We describe the term and get for our integral
So using a orthonormal function set allows us to reduce the integral to a summation of coefficient which can be estimated from the data by plugging in the formula above. The coefficients can be precomputed in advance.
The only problem left is to find the set of orthonormal polynomials upto degree . We know that form a basis for this space. We have to apply the Gram-Schmidt orthogonalization to find the orthonormal polynomials. This has been started in the first example.
The resulting polynomials are called normalized Legendre polynomials. Up to a scaling factor the normalized Legendre polynomials are identical to Legendre polynomials. The Legendre polynomials have a recursive expression of the form
So computing our integral reduces to computing and and using the recursive relationship to compute the 's. Please note that the recursion can be numerically unstable!