Our destination is the well-known basic linear regression solution for the optimal weights
where is our training data features and is a vector of the corresponding training data labels.
The route we’ll use to get there is vector geometry. Our checkpoints will be:
- We’ll think about finding by solving the familiar system of linear equations but we will see that this often has no solution.
- We’ll therefore change paths and solve for in , where is the best approximation of for which a solution for exists.
- This will lead us to vector geometry, which will define as the orthogonal projection of onto the span of .
- Here, we’ll see what an orthogonal projection is and prove that it will give the best approximation of
- We’ll then see how to get the optimal weights in
A System of Linear Equations
The goal of regression is to estimate the function of a real-world data-generating phenomenon. The estimation is done using sample data drawn from the phenomenon. This phenomenon could be the pricing of houses in Rwanda and the sampled data could be the price, size, number of rooms, location and seller. If the goal is to predict the house price, then we will estimate a function that maps any house’s other features to its price.
Let house ’s price be the scalar and its other features like size and number of rooms be represented as a vector . Therefore, .
Linear regression assumes is a linear function,
Since , let’s make this
Since we have the s and s, to find the linear function , we must find a vector that satisfies the system of linear equations
where is the number of houses in our sample data.
We can condense this system of linear equations into the matrix form
where
The problem with this is, with real-world data, the system will often have no solution. That is, there is no that satisfies all the equations in the system, or from a linear algebra view, cannot be expressed as a linear combination of the columns of .
So, since the perfect does not exist, we instead look for in , where is (1) the closest thing to or the best approximation of (2) that can be expressed as a linear combination of so that a solution for in exists.
The Closest Thing to that can be expressed as a Linear Combination of
We’ve seen that there is often no solution for in the system . Since we needed to get , this means there is often no linear function that perfectly maps all training data features to their respective labels .
Since there is often no linear , we have to settle for an optimal approximation. Remember the reason we can’t find is because cannot be expressed as a linear combination of . So, instead of , we use a vector that is the closest thing to but also a linear combination of . Our problem therefore changes to solving for in . With we will have all we need to get the optimal linear approximation of .
But this leads to a new question. How do you determine - the vector that is closest to but also a linear combination of ? We will use vector geometry and, specifically, the concept of orthogonal projection to answer this question.
Orthogonal Projection
We are looking for , the vector that is closest to but is also a linear combination of .
For now, let’s assume and are points and is a line on a 2D plane.
Using basic geometry and the Pythagorean theorem, we show that the point on line that is closest to point is the foot of the perpendicular dropped from point to line . Let be the distance from point to . For any point on line
We get by taking square roots.
Think about this proof for two minutes.
And then think about how this geometry problem relates to our linear algebra problem for two minutes.
For the geometry problem, we have found that , the closest point to that is on line is the foot of the perpendicular dropped from point onto line . In the linear algebra problem, similarly, , the closest vector to vector that is in the span of is the foot of a perpendicular dropped from onto the span of . The linear algebra term for is the orthogonal projection of onto the span of .
Since the Pythagorean theorem generalizes to vector spaces, we can prove that the vector closest to that can also be expressed as a linear combination of (or equivalently, is in the span of ) is the orthogonal projection of onto the span of as follows. For any vector in the span of
We get by taking square roots, with equivalence only when .
Think about this proof for two minutes.
It tells us that the distance between a vector and any vector is always greater than or equal to the distance between and the orthogonal projection of onto the span of , with equality only when is the orthogonal projection of onto the span of . Or inversely, and more to the point, the orthogonal projection of onto the span of is the vector in the span of that is closest to
Therefore, is the orthogonal projection of onto the span of .
Getting
Remember we want to find in . is given and we have just found that is the orthogonal projection of onto the span of . How do we find ?
To find , it is important to first know that the vector is orthogonal (”perpendicular”) to all the vectors in the span of and therefore all the column vectors in (which are in the span of themselves).
Let be the column vectors in . Since, as we have just said, is orthogonal to all the column vectors in and the inner product of orthogonal vectors is , we have the following conditions
Remember therefore, we can write the conditions as
We write these conditions in matrix form
and from this we get
We know is invertible because it is positive semi-definite (I won’t provide the proof for this here) therefore