Lecture 25
Orthogonal Projection and Least Squares Approximation

Orthogonal projection can be applied to one of the most important problems in the analysis of real-world data - finding approximate linear relationships among related but non-deterministic variables. (Salary, education level, years of employment, and gender are examples of such variables.) The simplest example is to consider the relationship between a pair of numerical variables, based on a discrete set of data, perhaps a sample of a large population.

If the data pairs are of the form (x1, y1), . . . (xn, yn), we might consider the plot of this data in the x-y plane and ask for the "best fit" for a straight line representing the data. The result would be an approximate empirical relationship of the form y = mx + b. The usual definition taken for "best fit" is to consider the x-coordinates as given inputs and the y-coordinates as aproximate outputs, and look to somehow minimize the y-distances from the corresponding y-values on the to-be-determined line. The square of each of those distances is of the form (yj - (mxj + b))2, and one could conisider the total distance measured in this fashion to be the square root of the sum of these squares. Minimizing that quantity is equivalent to minimizing the sum of the squares, that is, minimizing

E = (y 1 - (mx1 + b))2 + · · · + (yn - (mxn + b))2 .

The technique is called the method of least squares; the quantity E is called the error sum of squares, and the line for which the number E is minimized is called the least-squares line.

Let v1 be the vector in Rn with all coordinates "1," let v2 be the vector in Rn with the given coordinates xj, and let y be the vector in Rn with the given coordinates yj. Then

E = ||y - (b v1 + m v2 ) ||2,

and E is the square of the distance from y to the vector (bv1 + mv2 )in the subspace W = span{v1, v2}. The problem of minimizing E is thus the same as that of finding the orthogonal projection of y onto W, and the scalars b and m are the coefficients in that projection.

Since the typical data set will not have v2 be a multiple of v1 (clearly in that exceptional case, there will be no natural relationship with y as a function of x), we already know a method for constructing the orthogonal projection. We can express this known result in a simple matrix fashion, using the matrix A = [v1v2].

Theorem Let A be an n x k matrix whose columns form a basis for a subspace W of Rn. For any vector y in Rn, the orthogonal projection of y onto W is w = A(ATA)-1 ATy.

Indeed, we know that the orthogonal projection w is in the span of the columns of A, and hence is of the form w = Au for some vector u in Rk . Then (y - w) is perpendicular to W, which is the column space of A and hence is the row space of AT . But the perpendicular space to the row space of AT is the null space of AT, and therefore 0 = AT (y - w) = AT y - AT w = AT y - AT Au. Consequently,

ATAu = ATy,

and we can conclude that the orthogonal projection is w = A u = A (A T A ) -1 AT y as long as we know that the k x k matrix A T A is invertible.

To see this final claim, notice that, for any v in R k ,

||A v ||2 = A A v = (A v) T (Av)= vTATAv = 0 if ATAv = 0.

Hence Av = 0 if ATAv = 0. But the columns of A are independent, and thereforeA v = 0 if and only if v = 0. Consequently, ATAv = 0 if and only if v = 0, and therefore the square matrix ATA is invertible.

Example: In the least squares problem above, we find that the orthogonal projection is

(b v1 + mv2) = w = A(ATA) -1 ATy.

If we let u be the column vector in R2 with components b and m, then (b v1 + m v2 )= A u. From the proof above, A T A u = A T y, and therefore u = (A T A ) -1 AT y. Thus the slope and intercept of the least squares line are obtained from the input data A = [v1v2] and y in this fashion.

Examples

In our simple least squares problem, only two data variables are considered, the problem is to find a line, and the coefficient matrix in question has two columns, as the desired projection is onto a subspace of Rn of dimension 2. Notice that the general result we proved yields the projection onto a given subspace of any dimension k.

Orthogonal Projection Matrix for a Subspace
Let W be a subspace of R
n , and let {v1, . . ., vk} be a basis for W. If A = [v1 . . . vk], then ATA is invertible, and the orthogonal projection matrix for W is given by PW = A (A T A ) -1 AT . In other words, PW y is the orthogonal projection of y onto W for any vector y in R n .

In particular, suppose we have four interrelated data variables, for instance salary, education level, years of employment, and gender, with some sort of reasonable numerical coding for gender. We might consider looking for a linear relationship of the form "salary as an approximate function of the other three variables" and try to discover a best fit. Then we have n data points (lj ,ej ,gj ,yj), standing for level of education, employment years, gender, and salary respectively, and can look for the best fitting approximate relationship of the formy = m 1 l + m2 e + m3 g + b. We let v 1 be the vector with all coordinates "1" and v2 ,v3 ,v4 be the vectorin R n with the given coordinates lj, ej, and gj, and let y be the vector in Rn with the given coordinates yj. Then the least squares error becomes

E = ||y - (bv1 + m1 v1 + m2 v2 + m3 v3 )||2 ,

the square of the distance from y to the vector (bv1 + m1 v1 + m2 v2 + m3 v3 )in the subspace W = span{v1 , v2 ,v3 ,v 4 }. Again we need the projection of y onto W, and the scalars b and m1 , m2 , and m3 are found to be given by the 4-vector (ATA)-1ATy, with A = [v1 v2 v3 v4].

Back 250 Lecture Index