250syl.html

Lecture 25
Orthogonal Projection and Least Squares Approximation

Orthogonal projection can be applied to one of the most important problems in the analysis of real-world data - finding approximate linear relationships among related but non-deterministic variables. (Salary, education level, years of employment, and gender are examples of such variables.) The simplest example is to consider the relationship between a pair of numerical variables, based on a discrete set of data, perhaps a sample of a large population.

If the data pairs are of the form (x₁, y₁), . . . (x_n, y_n), we might consider the plot of this data in the x-y plane and ask for the "best fit" for a straight line representing the data. The result would be an approximate empirical relationship of the form y = mx + b. The usual definition taken for "best fit" is to consider the x-coordinates as given inputs and the y-coordinates as aproximate outputs, and look to somehow minimize the y-distances from the corresponding y-values on the to-be-determined line. The square of each of those distances is of the form (y_j - (mx_j + b))², and one could conisider the total distance measured in this fashion to be the square root of the sum of these squares. Minimizing that quantity is equivalent to minimizing the sum of the squares, that is, minimizing

E = (y ₁ - (mx₁ + b))² + · · · + (y_n - (mx_n + b))² .

The technique is called the method of least squares; the quantity E is called the error sum of squares, and the line for which the number E is minimized is called the least-squares line.

Let v₁ be the vector in Rⁿ with all coordinates "1," let v₂ be the vector in Rⁿ with the given coordinates x_j, and let y be the vector in Rⁿ with the given coordinates y_j. Then

E = ||y - (b v₁ + m v₂ ) ||²,

and E is the square of the distance from y to the vector (bv₁ + mv₂ )in the subspace W = span{v₁, v₂}. The problem of minimizing E is thus the same as that of finding the orthogonal projection of y onto W, and the scalars b and m are the coefficients in that projection.

Since the typical data set will not have v₂ be a multiple of v₁ (clearly in that exceptional case, there will be no natural relationship with y as a function of x), we already know a method for constructing the orthogonal projection. We can express this known result in a simple matrix fashion, using the matrix A = [v₁v₂].

Theorem Let A be an n x k matrix whose columns form a basis for a subspace W of Rⁿ. For any vector y in Rⁿ, the orthogonal projection of y onto W is w = A(A^TA)^-1 A^Ty.

Indeed, we know that the orthogonal projection w is in the span of the columns of A, and hence is of the form w = Au for some vector u in R^k . Then (y - w) is perpendicular to W, which is the column space of A and hence is the row space of A^T . But the perpendicular space to the row space of A^T is the null space of A^T, and therefore 0 = A^T (y - w) = A^T y - A^T w = A^T y - A^T Au. Consequently,

A^TAu = A^Ty,

and we can conclude that the orthogonal projection is w = A u = A (A ^T A )^-1 A^T y as long as we know that the k x k matrix A ^T A is invertible.

To see this final claim, notice that, for any v in R ^k ,

||A v ||² = A v· A v = (A v) ^T (Av)= v^TA^TAv = 0 if A^TAv = 0.

Hence Av = 0 if A^TAv = 0. But the columns of A are independent, and thereforeA v = 0 if and only if v = 0. Consequently, A^TAv = 0 if and only if v = 0, and therefore the square matrix A^TA is invertible.

Example: In the least squares problem above, we find that the orthogonal projection is

(b v₁ + mv₂) = w = A(A^TA) ^-1 A^Ty.

If we let u be the column vector in R² with components b and m, then (b v₁ + m v₂ )= A u. From the proof above, A ^T A u = A ^T y, and therefore u = (A ^T A )^-1 A^T y. Thus the slope and intercept of the least squares line are obtained from the input data A = [v₁v₂] and y in this fashion.

Examples

In our simple least squares problem, only two data variables are considered, the problem is to find a line, and the coefficient matrix in question has two columns, as the desired projection is onto a subspace of Rⁿ of dimension 2. Notice that the general result we proved yields the projection onto a given subspace of any dimension k.

Orthogonal Projection Matrix for a Subspace
Let W be a subspace of R ⁿ , and let {v₁, . . ., v_k} be a basis for W. If A = [v₁ . . . v_k], then A^TA is invertible, and the orthogonal projection matrix for W is given by P_W = A (A ^T A )^-1 A^T . In other words, P_W y is the orthogonal projection of y onto W for any vector y in R ⁿ .

In particular, suppose we have four interrelated data variables, for instance salary, education level, years of employment, and gender, with some sort of reasonable numerical coding for gender. We might consider looking for a linear relationship of the form "salary as an approximate function of the other three variables" and try to discover a best fit. Then we have n data points (l_j ,e_j ,g_j ,y_j), standing for level of education, employment years, gender, and salary respectively, and can look for the best fitting approximate relationship of the formy = m ₁ l + m₂ e + m₃ g + b. We let v₁ be the vector with all coordinates "1" and v₂ ,v₃ ,v₄ be the vectorin R ⁿ with the given coordinates l_j, e_j, and g_j, and let y be the vector in Rⁿ with the given coordinates y_j. Then the least squares error becomes

E = ||y - (bv₁ + m₁ v₁ + m₂ v₂ + m₃ v₃ )||² ,

the square of the distance from y to the vector (bv₁ + m₁ v₁ + m₂ v₂ + m₃ v₃ )in the subspace W = span{v₁ , v₂ ,v₃ ,v₄ }. Again we need the projection of y onto W, and the scalars b and m₁ , m₂ , and m₃ are found to be given by the 4-vector (A^TA)^-1A^Ty, with A = [v₁ v₂ v₃ v₄].

$Back$ 250 Lecture Index

Lecture 25 Orthogonal Projection and Least Squares Approximation

Lecture 25
Orthogonal Projection and Least Squares Approximation