Geometric View of Linear Regression

14 Mar 2024
  • # Linear Regression
  • # Linear Algebra

I assume the reader knows about vectors, span, vector spaces, norm, inner (dot) product, linear independence.

Our destination is the well-known basic linear regression solution for the optimal weights ​w\mathbf{w}^*

w=(XX)1Xy\mathbf{w}^* = (X^\top X)^{-1}X^\top \mathbf{y}

where XRN×dX \in \mathbb{R}^{N \times d} is our training data features and yRN\mathbf{y} \in \mathbb{R}^N is a vector of the corresponding training data labels.

The route we’ll use to get there is vector geometry. Our checkpoints will be:

  1. We’ll think about finding w\mathbf{w}^* by solving the familiar system of linear equations Xw=yX\mathbf{w}^* = \mathbf{y} but we will see that this often has no solution.
  2. We’ll therefore change paths and solve for w\mathbf{w}^* in Xw=y^X\mathbf{w}^* = \hat{\mathbf{y}}, where y^\hat{\mathbf{y}} is the best approximation of y\mathbf{y} for which a solution for w\mathbf{w}^* exists.
  3. This will lead us to vector geometry, which will define y^\hat{\mathbf{y}} as the orthogonal projection of y\mathbf{y} onto the span of XX.
    1. Here, we’ll see what an orthogonal projection is and prove that it will give the best approximation of y\mathbf{y}​
    2. We’ll then see how to get the optimal weights w\mathbf{w}^* in Xw=y^X\mathbf{w}^* = \hat{\mathbf{y}}​

A System of Linear Equations

The goal of regression is to estimate the function ff of a real-world data-generating phenomenon. The estimation is done using sample data drawn from the phenomenon. This phenomenon could be the pricing of houses in Rwanda and the sampled data could be the price, size, number of rooms, location and seller. If the goal is to predict the house price, then we will estimate a function ff that maps any house’s other features to its price.

Let house ii’s price be the scalar yiy_i and its other features like size and number of rooms be represented as a vector xi=[xi1xi2xid]Rd\mathbf{x}_i = \begin{bmatrix}x_{i1} & x_{i2} & \cdots & x_{id}\end{bmatrix}^\top \in \mathbb{R}^d. Therefore, f:xiyif:\mathbf{x}_i \to y_i.

Linear regression assumes ff is a linear function,

f(xi)=w1xi1+w2xi2++wdxidf(\mathbf{x}_i) = w_1x_{i1} + w_{2}x_{i2} + \dots + w_dx_{id}

Since f(xi)=yif(\mathbf{x}_i) = y_i, let’s make this

yi=w1xi1+w2xi2++wdxidy_i = w_1x_{i1} + w_{2}x_{i2} + \dots + w_dx_{id}

Since we have the yiy_is and xijx_{ij}s, to find the linear function ff, we must find a vector w=[w1w2wd]Rd\mathbf{w} = \begin{bmatrix}w_1 & w_2 & \cdots & w_d\end{bmatrix}^\top \in \mathbb{R}^d that satisfies the system of linear equations

y1=w1x11+w2x12++wdx1dy2=w1x21+w2x22++wdx2dyN=w1xN1+w2xN2++wdxNd\begin{align*} y_1 &= w_1x_{11} + w_2x_{12} + \cdots + w_dx_{1d} \\ y_2 &= w_1x_{21} + w_2x_{22} + \cdots + w_dx_{2d} \\ &\vdots \\ y_N &= w_1x_{N1} + w_2x_{N2} + \cdots + w_dx_{Nd} \end{align*}

where NN is the number of houses in our sample data.

We can condense this system of linear equations into the matrix form

Xw=yX\mathbf{w} = \mathbf{y}

where

X=[x1x2xN],y=[y1y2yN]X = \begin{bmatrix} \mathbf{x}_1^\top \\ \mathbf{x}_2^\top \\ \vdots \\ \mathbf{x}_N^\top \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}

The problem with this is, with real-world data, the system will often have no solution. That is, there is no w\mathbf{w} that satisfies all the equations in the system, or from a linear algebra view, y\mathbf{y} cannot be expressed as a linear combination of the columns of XX.

So, since the perfect w\mathbf{w} does not exist, we instead look for w\mathbf{w}^* in Xw=y^X\mathbf{w}^* = \hat{\mathbf{y}}, where y^\hat{\mathbf{y}} is (1) the closest thing to y\mathbf{y} or the best approximation of y\mathbf{y} (2) that can be expressed as a linear combination of XX so that a solution for w\mathbf{w}^* in Xw=y^X\mathbf{w}^* = \hat{\mathbf{y}} exists.

The Closest Thing to y\mathbf{y} that can be expressed as a Linear Combination of XX

We’ve seen that there is often no solution for w\mathbf{w} in the system Xw=yX\mathbf{w} = \mathbf{y}. Since we needed w\mathbf{w} to get ff, this means there is often no linear function that perfectly maps all training data features xi\mathbf{x}_i to their respective labels yiy_i.

Since there is often no linear ff, we have to settle for an optimal approximation. Remember the reason we can’t find ff is because y\mathbf{y} cannot be expressed as a linear combination of XX. So, instead of y\mathbf{y}, we use a vector y^\hat{\mathbf{y}} that is the closest thing to y\mathbf{y} but also a linear combination of XX. Our problem therefore changes to solving for w\mathbf{w}^* in Xw=y^X\mathbf{w}^* = \hat{\mathbf{y}}. With w\mathbf{w}^* we will have all we need to get the optimal linear approximation of ff.

But this leads to a new question. How do you determine y^\hat{\mathbf{y}} - the vector that is closest to y\mathbf{y} but also a linear combination of XX? We will use vector geometry and, specifically, the concept of orthogonal projection to answer this question.

Orthogonal Projection

We are looking for y^\hat{\mathbf{y}}, the vector that is closest to y\mathbf{y} but is also a linear combination of XX.

For now, let’s assume y^\hat{\mathbf{y}} and y\mathbf{y} are points and XX is a line on a 2D plane.

2D Geometric plain with a line labelled X and points b and y-hat on line X and point y outside the line. Dashed lines joining point y to point b and point y to point y-hat

Using basic geometry and the Pythagorean theorem, we show that the point y^\hat{\mathbf{y}} on line XX that is closest to point y\mathbf{y} is the foot of the perpendicular dropped from point y\mathbf{y} to line XX. Let d(x,y)d(x, y) be the distance from point xx to yy. For any point b\mathbf{b} on line XX

d(y,b)2=d(y,y^)2+d(y^,b)2d(y,y^)2d(\mathbf{y}, \mathbf{b})^2 = d(\mathbf{y}, \hat{\mathbf{y}})^2 + d(\hat{\mathbf{y}}, \mathbf{b})^2 \ge d(\mathbf{y}, \hat{\mathbf{y}})^2

We get d(y,b)d(y,y^)d(\mathbf{y}, \mathbf{b}) \ge d(\mathbf{y}, \hat{\mathbf{y}}) by taking square roots.

Think about this proof for two minutes.

And then think about how this geometry problem relates to our linear algebra problem for two minutes.

For the geometry problem, we have found that y^\hat{\mathbf{y}}, the closest point to y\mathbf{y} that is on line XX is the foot of the perpendicular dropped from point y\mathbf{y} onto line XX. In the linear algebra problem, similarly, y^\hat{\mathbf{y}}, the closest vector to vector y\mathbf{y} that is in the span of XX is the foot of a perpendicular dropped from y\mathbf{y} onto the span of XX. The linear algebra term for y^\hat{\mathbf{y}} is the orthogonal projection of y\mathbf{y} onto the span of XX.

Since the Pythagorean theorem generalizes to vector spaces, we can prove that the vector y^\hat{\mathbf{y}} closest to y\mathbf{y} that can also be expressed as a linear combination of XX (or equivalently, is in the span of XX) is the orthogonal projection of y\mathbf{y} onto the span of XX as follows. For any vector b\mathbf{b} in the span of XX​

yb2=yy^2+y^b2yy^2\|\mathbf{y}-\mathbf{b}\|^2 = \|\mathbf{y}-\hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}}-\mathbf{b}\|^2 \ge \|\mathbf{y}-\hat{\mathbf{y}}\|^2

We get ybyy^\|\mathbf{y}-\mathbf{b}\| \ge \|\mathbf{y}-\hat{\mathbf{y}}\| by taking square roots, with equivalence only when b=y^\mathbf{b} = \hat{\mathbf{y}}.

Think about this proof for two minutes.

It tells us that the distance between a vector yspan(X)\mathbf{y} \notin \text{span}(X) and any vector bspan(X)\mathbf{b} \in \text{span}(X) is always greater than or equal to the distance between y\mathbf{y} and the orthogonal projection of y\mathbf{y} onto the span of XX, with equality only when b\mathbf{b} is the orthogonal projection of y\mathbf{y} onto the span of XX. Or inversely, and more to the point, the orthogonal projection of y\mathbf{y} onto the span of XX is the vector in the span of XX that is closest to y\mathbf{y}

Therefore, y^\hat{\mathbf{y}} is the orthogonal projection of y\mathbf{y} onto the span of XX.

Getting w\mathbf{w}^*​

Remember we want to find w\mathbf{w}^* in Xw=y^X\mathbf{w}^* = \hat{\mathbf{y}}. XX is given and we have just found that y^\hat{\mathbf{y}} is the orthogonal projection of y\mathbf{y} onto the span of XX. How do we find w\mathbf{w}^*?

To find w\mathbf{w}^*, it is important to first know that the vector y^y\hat{\mathbf{y}} - \mathbf{y} is orthogonal (”perpendicular”) to all the vectors in the span of XX and therefore all the column vectors in XX (which are in the span of XX themselves).

Figure showing that the vector y-hat minus y is orthogonal to the vectors in the span of X
When X is N-by-1 and therefore the span of X is a 1-dimensional (a line)

Let x1,,xd\mathbf{x}_1, \cdots, \mathbf{x}_d be the column vectors in XX. Since, as we have just said, y^y\hat{\mathbf{y}} - \mathbf{y} is orthogonal to all the column vectors in XX and the inner product of orthogonal vectors is 00, we have the following conditions

x1,y^y=x1(y^y)=0x2,y^y=x2(y^y)=0xd,y^y=xd(y^y)=0\begin{align*} \langle \mathbf{x}_1, \hat{\mathbf{y}} - \mathbf{y} \rangle &= \mathbf{x}_1^\top(\hat{\mathbf{y}} - \mathbf{y}) = 0 \\ \langle \mathbf{x}_2, \hat{\mathbf{y}} - \mathbf{y} \rangle &= \mathbf{x}_2^\top(\hat{\mathbf{y}} - \mathbf{y}) = 0 \\ &\vdots \\ \langle \mathbf{x}_d, \hat{\mathbf{y}} - \mathbf{y} \rangle &= \mathbf{x}_d^\top(\hat{\mathbf{y}} - \mathbf{y}) = 0 \\ \end{align*}

Remember y^=Xw\hat{\mathbf{y}} = X\mathbf{w}^* therefore, we can write the conditions as

x1(Xwy)=0x2(Xwy)=0xd(Xwy)=0\begin{align*} \mathbf{x}_1^\top(X\mathbf{w}^* - \mathbf{y}) &= 0 \\ \mathbf{x}_2^\top(X\mathbf{w}^* - \mathbf{y}) &= 0 \\ &\vdots \\ \mathbf{x}_d^\top(X\mathbf{w}^* - \mathbf{y}) &= 0 \\ \end{align*}

We write these conditions in matrix form

X(Xwy)=0dX^\top(X\mathbf{w}^* - \mathbf{y}) = \mathbf{0}_d

and from this we get

XXw=XyX^\top X\mathbf{w}^* = X^\top \mathbf{y}

We know XXX^\top X is invertible because it is positive semi-definite (I won’t provide the proof for this here) therefore

w=(XX)1Xy\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}