3  Ordinary Least Squares

You are, no doubt, familiar with the ordinary least squares (OLS) estimator from your previous studies in Econometrics (EC226 or otherwise). OLS is an estimator for \(\beta\), it is not the only one. Indeed, you could use maximum likelihood methods to estimate \(\beta\).

The OLS estimator is the solution to,

\[ \min_b\;\sum_{i=1}^n(Y_i-X_i'b)^2 \]

Using vector notation, we can rewrite this as

\[ \begin{aligned} &\min_b\;(Y-Xb)'(Y-Xb)\\ =&\min_b\;Y'Y-Y'Xb-b'X'Y+b'X'Xb \\ =&\min_b\;Y'Y-2b'X'Y+b'X'Xb \end{aligned} \] From line 2 to 3 we use the face that \(Y'Xb\) is a scalar and therefore symmetric: \(Y'Xb=b'X'Y\).

[note] When working with vectors and matrices it is important to keep track of their size. You can only multiply two matrices/vectors if their column and row dimensions match. For example, if \(A\) and \(B\) are both \(n\times k\) matrices (\(n\neq k\)), then \(AB\) is not defined since \(A\) has \(k\) columns and \(B\) \(n\) rows. For the same reason \(BA\) is also not defined. However, you can pre-multiply \(B\) with \(A'\) as \(A'\) is a \(k\times n\) matrix: \(A'B\) is therefore a \((k\times n)\cdot (n\times k)=k\times k\) matrix. Similarly, \(B'A\) is defined, but is a \(n\times n\) matrix.

Order matters when working with matrices and vectors. Pre-multiplication and post-multiplication are not the same thing.

Keep track of the size of each term to ensure they correspond to one another. In this instance, each term should be a scalar. For example, \(-2b'X'Y\) is the multiplication of a scalar (\(-2\): size \(1\times 1\)), row vector (\(b'\): size \(1\times k\)), matrix (\(X'\): size \(k\times n\)), and column vector (\(Y\): size \(n\times 1\)). Thus we have a \((1\times 1)\cdot (1\times k)\cdot (k\times n)\cdot (n\times 1)=1\times 1\).

Differentiating the above expression w.r.t. the vecotr \(b\) and setting the first-order conditions to \(0\), we find that the following condition must hold for \(\hat{\beta}\), the solution.

\[ \begin{aligned} &0=-2X'Y+2X'X\hat{\beta} \\ \Rightarrow& X'X\hat{\beta} = X'Y \end{aligned} \]


How did we get this result? Deriving the first order conditions requires knowledge of how to solve for the derivative of a scalar respect to a column vector (in this case \(b\in R^k\)). Chapter 6 has some notes on vector differentiation.

We can ignore the first term \(Y'Y\) as it does not depend on \(b\). The second term is \(-2b'X'Y\). Here we can use the rule that, \[ \frac{\partial z'a}{\partial z} = \frac{\partial a'z}{\partial z} = a \] In this instance, \(a = X'Y \in R^k\). Thus, \[ \frac{\partial -2b'X'Y}{\partial b} = -2\frac{\partial b'X'Y}{\partial b} = -2X'Y \] The third term is \(b'X'Xb\). This is what is commonly referred to as a quadratic form: \(z'Az\). We know that the derivative of this form is, \[ \frac{\partial z'Az}{\partial z} = Az + A'z \] and if \(A\) is symmetric, the result simplies to \(2Az\). In this instance, \(A = X'X\) is symmetric and the derivative is given by, \[ \frac{\partial b'X'Xb}{\partial b} = 2X'X \]


In order to solve for \(\hat{\beta}\) we need to move the \(X'X\) term to the right-hand side. If these were scalars we would simply divide both sides by the same constant. However, as \(X'X\) is a matrix, division is not possible. Instead, we need to pre-multiply both sides by the inverse of \(X'X\): \((X'X)^{-1}\). Here’s the issue: the inverse of a matrix need not exist.

Given a square \(k\times k\) matrix \(A\), its inverse exists if and only if \(A\) is non-singular. For \(A\) to be non-singular its rank must have full rank: \(r(A)=k\), the number of rows/columns. This means that all \(k\) columns/rows must be linearly independent. (See Chapter 6 for a more detailed discussion of all these terms.)

In our application, \(A=X'X\) and

\[ r(X'X) = r(X) = colrank(X)\leq k \]

To insure that the inverse of \(X'X\) exists, \(X\) must have full column rank: all column vectors must be linearly independent. In practice, this means that no regressor can be a perfect linear combination of others. This gives rise to one of the key linear regression model assumptions:

[assumption] rank condition: \(r(X)=k\)

You may know this assumption by another name: the absence of perfect colinearity between regressors.

[comment] The rank condition is the reason we exclude a base category when working with categorical variables. We will revisit this subject in more detail in Chapter 5.

Recall, most linear regression models are specified with constant. Thus, the first column of \(X\) is

\[ X_1 = \begin{bmatrix}1 \\ 1 \\ \vdots \\ 1\end{bmatrix} \] a \(n\times 1\) vector vector of \(1\)’s, denoted here as \(\ell\). Suppose you have a categorical - for example, gender in an individual level dataset - that splits the same in two. The categories are assumed to be exhaustive and mutually exclusive. If you create two dummy variables, one for each category,

\[ X_2 = \begin{bmatrix}1 \\ \vdots \\1\\0\\ \vdots \\ 0\end{bmatrix}\qquad\text{and}\qquad X_3 = \begin{bmatrix}0 \\ \vdots \\0\\1\\ \vdots \\ 1\end{bmatrix} \]

it is evident that \(X_2+X_3 = \ell\). (Here I have depicted the sample as sorted along these two categories.) If \(X=[X_1\;X_2\;X_3]\), then it is rank-deficient: \(r(X) = 2<3\), since \(X_3=X_1-X2\). Thus, we can only include two of these three regressors. We can even exclude the constant and have \(X=[X_2\;X_3]\). In Chapter 4 we will show that the projection remains the same regardless of which category we exclude, even the constant.

If \(X\) is full rank, then \((X'X)^{-1}\) exists and,

\[ \hat{\beta} = (X'X)^{-1}X'Y \]

This relatively simple expression is the solution to least squares minimization problem. Just think, it would take less than three lines of code to programme this. That is the power of knowing a little linear algebra.

[comment] You may recognise the above expression from the Chapter 1 where we used vectors to define the mean estimator. It turns out, the mean estimator is the simplest of OLS estimators. It is a regression of \(Y\) against a constant alone: \(X=\ell\).

We can write the same expression in terms of summations over unit-level observations,

\[ \hat{\beta} = \big(\sum_{i=1}^nX_iX_i')^{-1}\sum_{i=1}^nX_iY_i \]

Note, the change in position of the transpose: \(X_i\) is a column vector \(\Rightarrow\) \(X_i'X_i\) is a scalar and \(X_iX_i'\) is a \(k\times k\) matrix. To match the first expression, the term inside the parenthesis must be a \(k\times k\) matrix. Similarly, \(X'Y\) is a \(k\times 1\) vector, as is \(X_iY_i\).

3.1 The uni-variate case

Undergraduate textbooks all teach a very similar expression for the OLS estimator of a uni-variate regression model (with a constant), such as

\[ Y_i = \beta_1+\beta_2X_{i2}+\varepsilon_i \]

[note] Once you are familiar with vector notation, it is relatively easy to tell whether a model is uni- or multi-variate. This is because the notation \(\beta_2 X_{i2}\) is not consistent with \(X_{2i}\) being a vector (row or column).

If \(X_{i2}\) is a \(k\times 1\) vector then so is \(\beta_2\). Thus, \(\beta_2 X_{i2}\) is \((k\times 1)\cdot (k\times1)\), which is not defined.

If \(X_{i2}\) is a row vector (as in Wooldridge, 2011), \(\beta_2 X_{i2}\) will then be \((k\times 1)\cdot (1\times k)\), a \(k\times k\) matrix. This cannot be correct since the model is defined at the unit level.

Thus, if you see a model written with the parameter in front of the regressor, you know that this must be a single regressor. This is subtle, yet imporant, distinction that researchers often use to convey the structure of their model. Whenever \(X_{i2}\) is a vector, researchers will almost always use the notation \(X_{i2}'\beta\) or \(X_{i2}\beta\), depending on whether \(X_{i2}\) is assumed to be a column or row vector.

We know that,

\[ \begin{aligned} \tilde{\beta}_2 =& \frac{\sum(Y_i-\bar{Y})(X_{i2}-\bar{X}_2)}{\sum_{i=1}^n(X_{i2}-\bar{X}_2)^2} \\ \text{and}\qquad \tilde{\beta}_1 =& \bar{Y}-\tilde{\beta_2}\bar{X}_2 \end{aligned} \]

I am deliberately using the notation \(\tilde{\beta}\) to distinguish these two estimators from the expression below.

Let us see if we can replicate this result. When written in vector notation, the model is,

\[ \begin{aligned} Y =& X\beta+\varepsilon \\ =& \begin{bmatrix}1&X_{12} \\ 1 & X_{22} \\ \vdots & \vdots \\ 1 & X_{n2}\end{bmatrix}\begin{bmatrix}\beta_1 \\ \beta_2 \end{bmatrix} + \varepsilon \\ =& \begin{bmatrix}\ell &X_{2} \end{bmatrix}\begin{bmatrix}\beta_1 \\ \beta_2 \end{bmatrix} + \varepsilon \end{aligned} \]

Therefore,

\[ \begin{aligned} \hat{\beta} = \begin{bmatrix}\hat{\beta}_1 \\ \hat{\beta}_2\end{bmatrix}=&(X'X)^{-1}X'Y \\ =&\bigg(\begin{bmatrix}\ell' \\ X_{2}' \end{bmatrix}\begin{bmatrix}\ell &X_{2}\end{bmatrix}\bigg)^{-1}\begin{bmatrix}\ell' \\ X_{2}' \end{bmatrix}Y \\ =&\begin{bmatrix}\ell'\ell & \ell'X_2 \\ X_{2}'\ell & X_2'X_2 \end{bmatrix}^{-1}\begin{bmatrix}\ell'Y \\ X_{2}'Y \end{bmatrix} \end{aligned} \]

I went through this rather quickly, using a number of linear algebra rules that you may not be familiar with. Do not worry, the point of the exercise is not become a linear algebra master, but instead to focus on the element of each of each matrix/vector. Each element is a scalar (size \(1\times 1\)).

If we right them each down as sums you they might be a little more familiar. (Take a look back at Chapter 1 to remind yourself of some of these steps). First consider the \(2\times 2\) matrix:

  • element [1,1]: \(\ell'\ell = \sum_{i=1}^n 1 = n\)

  • element [1,2]: \(\ell'X_2 = \sum_{i=1}^nX_{i2} = n\bar{X}_2\)

  • element [2,1]: \(X_2'\ell = \sum_{i=1}^nX_{i2} = n\bar{X}_2\) (as above, since scalars are symmetric)

  • element [2,2]: \(X_2'X_2=\sum_{i=1}^nX_{i2}^2\)

Next, consider the final \(2\times 1\) vector,

  • element [1,1]: \(\ell'Y = \sum_{i=1}^n Y_i = n\bar{Y}\)

  • element [2,1]: \(X_2'Y = \sum_{i=1}^nY_iX_{i2}\)

Our OLS estimator is therefore,

\[ \hat{\beta} = \begin{bmatrix} n & n\bar{X}_2 \\ n\bar{X}_2 & \sum_{i=1}^nX_{i2}^2 \end{bmatrix}^{-1}\begin{bmatrix}n\bar{Y} \\ \sum_{i=1}^nY_iX_{i2} \end{bmatrix} \]

We now need to solve for the inverse of the \(2\times 2\) matrix. You can easily find notes on how to do this online. Here, I will just provide the solution.

\[ \hat{\beta} = \frac{1}{n\sum_{i=1}^nX_{i2}^2-n^2\bar{X}_2^2}\begin{bmatrix} \sum_{i=1}^nX_{i2}^2 & -n\bar{X}_2 \\ -n\bar{X}_2 & n\end{bmatrix}\begin{bmatrix}n\bar{Y} \\ \sum_{i=1}^nY_iX_{i2} \end{bmatrix} \]

Remember, this is still a \(2\times 1\) vector. We can now solve for the final solution:

\[ \begin{aligned} \hat{\beta} =& \frac{1}{n\sum_{i=1}^nX_{i2}^2-n^2\bar{X}_2^2}\begin{bmatrix} n\bar{Y}\sum_{i=1}^nX_{i2}^2 -n\bar{X}_2\sum_{i=1}^nY_iX_{i2} \\ n\sum_{i=1}^nY_iX_{i2}-n^2\bar{X}_2\bar{Y}\end{bmatrix} \\ =& \frac{1}{n\sum_{i=1}^n(X_{i2}-\bar{X}_2)^2}\begin{bmatrix} n\bar{Y}\sum_{i=1}^nX_{i2}^2 + n^2\bar{Y}\bar{X}^2 - n^2\bar{Y}\bar{X}^2 -n\bar{X}_2\sum_{i=1}^nY_iX_{i2} \\ n\sum_{i=1}^n(Y_i-\bar{Y})(X_{i2}-\bar{X}_2) \end{bmatrix} \\ =& \frac{1}{n\sum_{i=1}^n(X_{i2}-\bar{X}_2)^2}\begin{bmatrix} n\bar{Y}\sum_{i=1}^n(X_{i2}-\bar{X})^2 -n\bar{X}_2\sum_{i=1}^n(Y_i-\bar{Y})(X_{i2}-\bar{X}_2) \\ n\sum_{i=1}^n(Y_i-\bar{Y})(X_{i2}-\bar{X}_2) \end{bmatrix} \\ =& \begin{bmatrix} \bar{Y} -\frac{n\sum_{i=1}^n(Y_i-\bar{Y})(X_{i2}-\bar{X}_2)}{n\sum_{i=1}^n(X_{i2}-\bar{X}_2)^2}\bar{X}_2 \\ \frac{n\sum_{i=1}^n(Y_i-\bar{Y})(X_{i2}-\bar{X}_2)}{n\sum_{i=1}^n(X_{i2}-\bar{X}_2)^2} \end{bmatrix} \\ =& \begin{bmatrix} \bar{Y} -\tilde{\beta}_2\bar{X}_2 \\ \tilde{\beta}_2 \end{bmatrix} \\ =& \begin{bmatrix}\tilde{\beta}_1 \\ \tilde{\beta}_2 \end{bmatrix} \end{aligned} \]

The math is a little involved, but it shows you these solutions are are the same. Unfortunately, the working gets even more arduous in a multivariate context. However, there are useful tools to help us with that we will discuss next.