4 The Geometry of OLS
In the last section we saw how the OLS estimator can, more generally, be described as a linear transformation of the \(Y\) vector.
\[ \hat{\beta} = (X'X)^{-1}X'Y \]
We also saw that in order for there to be a (unique) solution to the least squared problem, the \(X\) matrix must be full rank. This rules out any perfect colinearity between columns (i.e. regressors) in the \(X\) matrix, including the constant.
Given the vector of OLS coefficients, we can also estimate the residual,
\[ \begin{aligned} \hat{\varepsilon} =& Y - X\hat{\beta} \\ =&Y-X(X'X)^{-1}X'Y \\ =&(I_n-X(X'X)X^{-1})Y \end{aligned} \]
by plugging the definition of \(\hat{\beta}\). Thus, the OLS estimator separates the vector \(Y\) into two components:
\[ \begin{aligned} Y =& X\hat{\beta} + \hat{\varepsilon} \\ =&\underbrace{X(X'X)^{-1}X'}_{P_X}Y + (\underbrace{I_n-X(X'X)X^{-1}}_{I_n-P_X = M_X})Y \\ =&P_XY + M_XY \end{aligned} \]
The matrix \(P_X = X(X'X)^{-1}X'\) is a \(n\times n\) projection matrix. It is a linear transformation that projects any vector into the span of \(X\): \(S(X)\). (See Chapter 6 for more information on these terms.) \(S(X)\) is the vector space spanned by the columns of \(X\). The dimensions of this vector space depends on the rank of \(P_X\),
\[ dim(S(X)) = r(P_X) = r(X) = k \]
The matrix \(M_X = I_n-X(X'X)^{-1}X'\) is also a \(n\times n\) projection matrix. It projects any vector into \(X\)’s orthogonal span: \(S^{\perp}(X)\). Any vector \(z\in S^{\perp}(X)\) is orthogonal to \(X\). This includes the estimated residual, which is by definition orthogonal to the predicted values and, indeed, any column of \(X\) (i.e. any regressor). The dimension of this orthogonal vector space depends on the rank of \(M_X\),
\[ dim(S^{\perp}(X)) = r(M_X) = r(I_n)-r(X) = n-k \]
The orthogonality of these two projections can be easily shown, since projection matrices are idempotent (\(P_XP_X = P_X\)) and symmetric (\(P_X' = P_X\)). Consider the inner product of these two projections,
\[ P_X'M_X = P_X(I_n-P_X) = P_X-P_XP_X = P_X-P_X = 0 \]
The least squares estimator is a projection of Y into two vector spaces: one the span of the columns of \(X\) and the other a space orthogonal to \(X\).
Why is this useful? Well, it helps us understand the “mechanics” (technically geometry) of OLS. When work with linear regression models, we typically assume either strict exogeneity - \(E[\varepsilon|X]=0\) - or uncorrelatedness - \(E[X'\varepsilon]=0\) - where the former implies the latter (but not the other way around).
When we use OLS, we estimate the vector \(\hat{\beta}\) such that,
\[ X'(Y-X\hat{\beta})=X'\hat{\varepsilon}=0 \quad always \]
This is true, not just in expectation, but by definition. The relationship is “mechanical”: the covariates and estimated residual are perfectly uncorrelated. This can be easily shown:
\[ \begin{aligned} X'\hat{\varepsilon} =& X'M_XY \\ =& X'(I_n-P_X)Y \\ =&X'I_nY-X'X(X'X)^{-1}X'Y \\ =&X'Y-X'Y \\ =&0 \end{aligned} \]
You are essentially imposing the assumption of uncorrelatedness between the explained and unexplained components of Y on the data. This means that if the assumption is wrong, so is the projection.
4.1 Partitioned regression
The tools of linear algebra can help us better understand partitioned regression. Indeed, I would go as far to to say that it is quite difficult to understand partitioned regression without an understanding of projection matrices. Moreover, we need to understand partitioned regression to really understand multivariate regression.
Let us divide the set of regressors into two groups: \(X_1\) a single regressor and \(X_2\) a \(n\times (k-1)\) matrix. We can rewrite the linear model as,
\[ Y = X\beta+ \varepsilon =\beta_1X_1+X_2\beta_2+\varepsilon \]
Partitioned regression is typically taught as follows. The OLS estimator for \(\beta_1\) can achieved by first regressing \(X_1\) on \(X_2\),
\[ X_1 = X_2\gamma_2+\upsilon_1 \]
Next, you regress \(Y\) on the residual from the above model,
\[ Y = \gamma_1 \hat{\upsilon}_1+\xi \]
The partitioned regression result states that the OLS estimator for \(\hat{\gamma_1}=\hat{\beta_1}\). This aids in our understanding of \(\beta_1\) as the partial effect of \(X_1\) on \(Y\), holding \(X_2\) constant.
We can show this using projection matrices. Let us begin by applying our existing knowledge. From above, we know that the residual from the regression of \(X_1\) on \(X_2\) is,
\[ \hat{\upsilon} = M_2X_1 \]
where \(M_2 = I_n-X_2(X_2'X_2)^{-1}X_2'\). Thus, the model we estimate in the second step, is
\[ Y = \gamma_1M_2X_1+\xi \]
We know that \(\hat{\gamma}_1 = (\hat{\upsilon}'\hat{\upsilon})^{-1}\hat{\upsilon}'Y\). Replacing the value of the residual, we get> [note] We use both the symmetry and idempotent quality of \(M_2\).
\[ \begin{aligned} \hat{\gamma}_1 =& (\hat{\upsilon}'\hat{\upsilon})^{-1}\hat{\upsilon}'Y \\ =&(X_1'M_2M_2X_1)^{-1}X_1'M_2Y \\ =&(X_1'M_2X_1)^{-1}X_1'M_2Y \end{aligned} \]
Next we want to show that \(\hat{\beta}_1\) is given by the same value. This part is more complicated. Let’s start with by reminding ourselves of the following:
\[ \begin{aligned} X'X\hat{\beta} &= X'Y \\ \begin{bmatrix}X_1 & X_2\end{bmatrix}'\begin{bmatrix}X_1 & X_2\end{bmatrix}\begin{bmatrix}\hat{\beta}_1 \\ \hat{\beta}_2\end{bmatrix} &= \begin{bmatrix}X_1 & X_2\end{bmatrix}'Y \\ \begin{bmatrix}X_1'X_1 & X_1'X_2 \\ X_2'X_1 & X_2'X_2 \end{bmatrix}\begin{bmatrix}\hat{\beta}_1 \\ \hat{\beta}_2\end{bmatrix} &= \begin{bmatrix}X_1'Y \\ X_2'Y\end{bmatrix} \end{aligned} \]
We could solve for \(\hat{\beta}_1\) by solving for the inverse of \(X'X\); however, this will take a long time. An easier approach is to simply verify that \(\hat{\beta}_1=(X_1'M_2X_1)^{-1}X_1'M_2Y\). Recall, \(\hat{\beta}\) splits \(Y\) into two components:
\[ Y = \hat{\beta}_1X_1+X_2\hat{\beta}_2 + \hat{\varepsilon} \]
If we plug this definition of \(Y\) into the above expression we get,
\[ \begin{aligned} &(X_1'M_2X_1)^{-1}X_1'M_2(\hat{\beta}_1X_1+X_2\hat{\beta}_2 + \hat{\varepsilon}) \\ =&\hat{\beta}_1\underbrace{(X_1'M_2X_1)^{-1}X_1'M_2X_1}_{=I_n} \\ &+\underbrace{(X_1'M_2X_1)^{-1}X_1'M_2X_2\hat{\beta_2}}_{=0} \\ &+\underbrace{(X_1'M_2X_1)^{-1}X_1'M_2\hat{\varepsilon}}_{=0} \\ =&\hat{\beta}_1 \end{aligned} \]
In line 2, I use the fact that \(\hat{\beta}_1\) is a scalar and can be moved to the front (since the order of multiplication does not matter with a scalar). In line 3, I use the fact that \(M_2X_2=0\) by definition. Line 4 uses the fact that \(M_2\hat{\varepsilon}=\hat{\varepsilon}\) which means that \(X_1'M_2\hat{\varepsilon}=X_1'\hat{\varepsilon}=0\).
The OLS estimator solves for \(\beta_1\) using the variance in \(X_1\) that is orthogonal to \(X_2\). This is the manner in which we “hold \(X_2\) constant”: the variation in \(M_2X_1\) is orthogonal to \(X_2\). Changes in \(M_2X_1\) are uncorrelated with changes in \(X_2\); as if the variation in \(M_2X_1\) arose independently of \(X_2\). However, uncorrelatedness does NOT imply independence.