Geometric View of Linear Regression

I assume the reader knows about vectors, span, vector spaces, norm, inner (dot) product, linear independence.

Our destination is the well-known basic linear regression solution for the optimal weights $\mathbf{w}^*$

\mathbf{w}^* = (X^\top X)^{-1}X^\top \mathbf{y}

where $X \in \mathbb{R}^{N \times d}$ is our training data features and $\mathbf{y} \in \mathbb{R}^N$ is a vector of the corresponding training data labels.

The route we’ll use to get there is vector geometry. Our checkpoints will be:

We’ll think about finding $\mathbf{w}^*$ by solving the familiar system of linear equations $X\mathbf{w}^* = \mathbf{y}$ but we will see that this often has no solution.
We’ll therefore change paths and solve for $\mathbf{w}^*$ in $X\mathbf{w}^* = \hat{\mathbf{y}}$ , where $\hat{\mathbf{y}}$ is the best approximation of $\mathbf{y}$ for which a solution for $\mathbf{w}^*$ exists.
This will lead us to vector geometry, which will define $\hat{\mathbf{y}}$ as the orthogonal projection of $\mathbf{y}$ onto the span of $X$ .
1. Here, we’ll see what an orthogonal projection is and prove that it will give the best approximation of $\mathbf{y}$
2. We’ll then see how to get the optimal weights $\mathbf{w}^*$ in $X\mathbf{w}^* = \hat{\mathbf{y}}$

A System of Linear Equations

The goal of regression is to estimate the function $f$ of a real-world data-generating phenomenon. The estimation is done using sample data drawn from the phenomenon. This phenomenon could be the pricing of houses in Rwanda and the sampled data could be the price, size, number of rooms, location and seller. If the goal is to predict the house price, then we will estimate a function $f$ that maps any house’s other features to its price.

Let house $i$ ’s price be the scalar $y_{i}$ and its other features like size and number of rooms be represented as a vector $\mathbf{x}_i = \begin{bmatrix}x_{i1} & x_{i2} & \cdots & x_{id}\end{bmatrix}^\top \in \mathbb{R}^d$ . Therefore, $f:\mathbf{x}_i \to y_i$ .

Linear regression assumes $f$ is a linear function,

f(\mathbf{x}_i) = w_1x_{i1} + w_{2}x_{i2} + \dots + w_dx_{id}

Since $f(\mathbf{x}_i) = y_i$ , let’s make this

y_i = w_1x_{i1} + w_{2}x_{i2} + \dots + w_dx_{id}

Since we have the $y_{i}$ s and $x_{ij}$ s, to find the linear function $f$ , we must find a vector $\mathbf{w} = \begin{bmatrix}w_1 & w_2 & \cdots & w_d\end{bmatrix}^\top \in \mathbb{R}^d$ that satisfies the system of linear equations

y1=w1x11+w2x12+⋯+wdx1dy2=w1x21+w2x22+⋯+wdx2dyN=w1xN1+w2xN2+⋯+wdxNd\begin{align*}
y_1 &= w_1x_{11} + w_2x_{12} + \cdots + w_dx_{1d} \\
y_2 &= w_1x_{21} + w_2x_{22} + \cdots + w_dx_{2d} \\
&\vdots \\
y_N &= w_1x_{N1} + w_2x_{N2} + \cdots + w_dx_{Nd}
\end{align*}y1​y2​yN​​=w1​x11​+w2​x12​+⋯+wd​x1d​=w1​x21​+w2​x22​+⋯+wd​x2d​⋮=w1​xN1​+w2​xN2​+⋯+wd​xNd​​

where $N$ is the number of houses in our sample data.

We can condense this system of linear equations into the matrix form

X\mathbf{w} = \mathbf{y}

where

X=[x1⊤x2⊤xN⊤],y=[y1y2yN]X = \begin{bmatrix} \mathbf{x}_1^\top \\ \mathbf{x}_2^\top \\ \vdots \\ \mathbf{x}_N^\top \end{bmatrix}, \quad
\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}X=​x1⊤​x2⊤​⋮xN⊤​​​,y=​y1​y2​⋮yN​​​

The problem with this is, with real-world data, the system will often have no solution. That is, there is no $\mathbf{w}$ that satisfies all the equations in the system, or from a linear algebra view, $\mathbf{y}$ cannot be expressed as a linear combination of the columns of $X$ .

So, since the perfect $\mathbf{w}$ does not exist, we instead look for $\mathbf{w}^*$ in $X\mathbf{w}^* = \hat{\mathbf{y}}$ , where $\hat{\mathbf{y}}$ is (1) the closest thing to $\mathbf{y}$ or the best approximation of $\mathbf{y}$ (2) that can be expressed as a linear combination of $X$ so that a solution for $\mathbf{w}^*$ in $X\mathbf{w}^* = \hat{\mathbf{y}}$ exists.

The Closest Thing to $\mathbf{y}$ that can be expressed as a Linear Combination of $X$

We’ve seen that there is often no solution for $\mathbf{w}$ in the system $X\mathbf{w} = \mathbf{y}$ . Since we needed $\mathbf{w}$ to get $f$ , this means there is often no linear function that perfectly maps all training data features $\mathbf{x}_i$ to their respective labels $y_{i}$ .

Since there is often no linear $f$ , we have to settle for an optimal approximation. Remember the reason we can’t find $f$ is because $\mathbf{y}$ cannot be expressed as a linear combination of $X$ . So, instead of $\mathbf{y}$ , we use a vector $\hat{\mathbf{y}}$ that is the closest thing to $\mathbf{y}$ but also a linear combination of $X$ . Our problem therefore changes to solving for $\mathbf{w}^*$ in $X\mathbf{w}^* = \hat{\mathbf{y}}$ . With $\mathbf{w}^*$ we will have all we need to get the optimal linear approximation of $f$ .

But this leads to a new question. How do you determine $\hat{\mathbf{y}}$ - the vector that is closest to $\mathbf{y}$ but also a linear combination of $X$ ? We will use vector geometry and, specifically, the concept of orthogonal projection to answer this question.

Orthogonal Projection

We are looking for $\hat{\mathbf{y}}$ , the vector that is closest to $\mathbf{y}$ but is also a linear combination of $X$ .

For now, let’s assume $\hat{\mathbf{y}}$ and $\mathbf{y}$ are points and $X$ is a line on a 2D plane.

2D Geometric plain with a line labelled X and points b and y-hat on line X and point y outside the line. Dashed lines joining point y to point b and point y to point y-hat

Using basic geometry and the Pythagorean theorem, we show that the point $\hat{\mathbf{y}}$ on line $X$ that is closest to point $\mathbf{y}$ is the foot of the perpendicular dropped from point $\mathbf{y}$ to line $X$ . Let $d (x, y)$ be the distance from point $x$ to $y$ . For any point $\mathbf{b}$ on line $X$

d(y,b)2=d(y,y^)2+d(y^,b)2≥d(y,y^)2d(\mathbf{y}, \mathbf{b})^2 = d(\mathbf{y}, \hat{\mathbf{y}})^2 + d(\hat{\mathbf{y}}, \mathbf{b})^2 \ge d(\mathbf{y}, \hat{\mathbf{y}})^2d(y,b)2=d(y,y^​)2+d(y^​,b)2≥d(y,y^​)2

We get $d(\mathbf{y}, \mathbf{b}) \ge d(\mathbf{y}, \hat{\mathbf{y}})$ by taking square roots.

Think about this proof for two minutes.

And then think about how this geometry problem relates to our linear algebra problem for two minutes.

For the geometry problem, we have found that $\hat{\mathbf{y}}$ , the closest point to $\mathbf{y}$ that is on line $X$ is the foot of the perpendicular dropped from point $\mathbf{y}$ onto line $X$ . In the linear algebra problem, similarly, $\hat{\mathbf{y}}$ , the closest vector to vector $\mathbf{y}$ that is in the span of $X$ is the foot of a perpendicular dropped from $\mathbf{y}$ onto the span of $X$ . The linear algebra term for $\hat{\mathbf{y}}$ is the orthogonal projection of $\mathbf{y}$ onto the span of $X$ .

Since the Pythagorean theorem generalizes to vector spaces, we can prove that the vector $\hat{\mathbf{y}}$ closest to $\mathbf{y}$ that can also be expressed as a linear combination of $X$ (or equivalently, is in the span of $X$ ) is the orthogonal projection of $\mathbf{y}$ onto the span of $X$ as follows. For any vector $\mathbf{b}$ in the span of $X $

∥y−b∥2=∥y−y^∥2+∥y^−b∥2≥∥y−y^∥2\|\mathbf{y}-\mathbf{b}\|^2 = \|\mathbf{y}-\hat{\mathbf{y}}\|^2 + \|\hat{\mathbf{y}}-\mathbf{b}\|^2 \ge \|\mathbf{y}-\hat{\mathbf{y}}\|^2∥y−b∥2=∥y−y^​∥2+∥y^​−b∥2≥∥y−y^​∥2

We get $\|\mathbf{y}-\mathbf{b}\| \ge \|\mathbf{y}-\hat{\mathbf{y}}\|$ by taking square roots, with equivalence only when $\mathbf{b} = \hat{\mathbf{y}}$ .

Think about this proof for two minutes.

It tells us that the distance between a vector $\mathbf{y} \notin \text{span}(X)$ and any vector $\mathbf{b} \in \text{span}(X)$ is always greater than or equal to the distance between $\mathbf{y}$ and the orthogonal projection of $\mathbf{y}$ onto the span of $X$ , with equality only when $\mathbf{b}$ is the orthogonal projection of $\mathbf{y}$ onto the span of $X$ . Or inversely, and more to the point, the orthogonal projection of $\mathbf{y}$ onto the span of $X$ is the vector in the span of $X$ that is closest to $\mathbf{y}$

Therefore, $\hat{\mathbf{y}}$ is the orthogonal projection of $\mathbf{y}$ onto the span of $X$ .

Getting $\mathbf{w}^*$

Remember we want to find $\mathbf{w}^*$ in $X\mathbf{w}^* = \hat{\mathbf{y}}$ . $X$ is given and we have just found that $\hat{\mathbf{y}}$ is the orthogonal projection of $\mathbf{y}$ onto the span of $X$ . How do we find $\mathbf{w}^*$ ?

To find $\mathbf{w}^*$ , it is important to first know that the vector $\hat{\mathbf{y}} - \mathbf{y}$ is orthogonal (”perpendicular”) to all the vectors in the span of $X$ and therefore all the column vectors in $X$ (which are in the span of $X$ themselves).

Figure showing that the vector y-hat minus y is orthogonal to the vectors in the span of X — When X is N-by-1 and therefore the span of X is a 1-dimensional (a line)

Let $\mathbf{x}_1, \cdots, \mathbf{x}_d$ be the column vectors in $X$ . Since, as we have just said, $\hat{\mathbf{y}} - \mathbf{y}$ is orthogonal to all the column vectors in $X$ and the inner product of orthogonal vectors is $0$ , we have the following conditions

⟨x1,y^−y⟩=x1⊤(y^−y)=0⟨x2,y^−y⟩=x2⊤(y^−y)=0⟨xd,y^−y⟩=xd⊤(y^−y)=0\begin{align*}
\langle \mathbf{x}_1, \hat{\mathbf{y}} - \mathbf{y} \rangle &= \mathbf{x}_1^\top(\hat{\mathbf{y}} - \mathbf{y}) = 0 \\
\langle \mathbf{x}_2, \hat{\mathbf{y}} - \mathbf{y} \rangle &= \mathbf{x}_2^\top(\hat{\mathbf{y}} - \mathbf{y}) = 0 \\
&\vdots \\
\langle \mathbf{x}_d, \hat{\mathbf{y}} - \mathbf{y} \rangle &= \mathbf{x}_d^\top(\hat{\mathbf{y}} - \mathbf{y}) = 0 \\
\end{align*}⟨x1​,y^​−y⟩⟨x2​,y^​−y⟩⟨xd​,y^​−y⟩​=x1⊤​(y^​−y)=0=x2⊤​(y^​−y)=0⋮=xd⊤​(y^​−y)=0​

Remember $\hat{\mathbf{y}} = X\mathbf{w}^*$ therefore, we can write the conditions as

x1⊤(Xw∗−y)=0x2⊤(Xw∗−y)=0xd⊤(Xw∗−y)=0\begin{align*}
 \mathbf{x}_1^\top(X\mathbf{w}^* - \mathbf{y}) &= 0 \\
\mathbf{x}_2^\top(X\mathbf{w}^* - \mathbf{y}) &= 0 \\
&\vdots \\
\mathbf{x}_d^\top(X\mathbf{w}^* - \mathbf{y}) &= 0 \\
\end{align*}x1⊤​(Xw∗−y)x2⊤​(Xw∗−y)xd⊤​(Xw∗−y)​=0=0⋮=0​

We write these conditions in matrix form

X^\top(X\mathbf{w}^* - \mathbf{y}) = \mathbf{0}_d

and from this we get

X^\top X\mathbf{w}^* = X^\top \mathbf{y}

We know $X^\top X$ is invertible because it is positive semi-definite (I won’t provide the proof for this here) therefore

\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}

A System of Linear Equations

The Closest Thing to y\mathbf{y}y that can be expressed as a Linear Combination of XXX

Orthogonal Projection

Getting w∗​\mathbf{w}^*​w∗​

The Closest Thing to $\mathbf{y}$ that can be expressed as a Linear Combination of $X$

Getting $\mathbf{w}^*$