5 Dummy Variables

Dummy variables are used extensively in Microeconometrics. They are used to model discrete public policy and experimental ‘treatments’ in Microeconomics and are the building blocks of “fixed effects” used widely in Microeconomics models.

Any dummy variable, \(D_i=\mathbf{1}\{true\}\), splits the sample into two groups: true (e.g. treated) and false (e.g. control/untreated). In a basic setting the use a of dummy variable might be relatively straight forward. For example, consider a linear model using to assess a single treatment from a randomized control trial,

\[ Y_i = \beta_{10}+\beta_{11} D_{1i} + \varepsilon_i \]

where \(D_{1i} = \mathbf{1}\{treated\}\) identifies those who are treated in the sample.

For ease of demonstration, let us assume that the data is sorted on \(D_i\): the first \(N_0\) observations are the untreated control group, and the next \(N_1 = N-N_0\) the treated. From previous chapters, we know that the matrix of regressors in this model is given by,

\[ X_1=[\ell,D_1]=\begin{bmatrix}1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ \vdots & \vdots \\ 1 & 1 \end{bmatrix} \]

The above matrix has rank 2, as the columns are linearly independent. Likewise, the rank of the \(2\times 2\) matrix \(X_1'X_1\) is also 2 (i.e. full rank) and is therefore invertible. If \(N=N_0\) - i.e., no units were treated - the matrix would be rank deficient. Similarly, if \(N_0=0\) - i.e., all units were treated - \(D_1\) would be a column of \(1\)’s, colinear with the constant.

Consider then the regressor \(D_{2i} = 1-D_{1i} = \mathbf{1}\{control\}\). What would happen if we included \(D_{2i}\) in the above model? The matrix of regressors would be,

\[ X=[\ell,D_1,D_2]=\begin{bmatrix}1 & 0 &1 \\ \vdots & \vdots & \vdots \\ 1 & 0 &1 \\ 1 & 1 &0\\ \vdots & \vdots & \vdots \\ 1 & 1 & 0 \end{bmatrix} \]

The rank of matrix \(X\) is not 3; it remains 2. This because the three columns of \(X\) are linearly dependent:

\[ \begin{bmatrix}1 \\ \vdots \\ 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}=\begin{bmatrix}0 \\ \vdots \\ 0 \\ 1 \\ \vdots \\ 1 \end{bmatrix}+\begin{bmatrix}1 \\ \vdots \\ 1 \\ 0\\ \vdots \\ 0 \end{bmatrix} \]

You can have ANY two of the columns in the model, but not all three. Moreover, regardless which two you choose, the projection remains the same. Consider, the three possible models,

\[ \begin{aligned} Y_i =& \beta_{10}+\beta_{11} D_{1i}+\varepsilon_i=X_1\beta_1+\varepsilon_i & \qquad\text{where} \quad X_1 = [\ell\;D_1] \\ Y_i =& \beta_{20}+\beta_{22} D_{2i}+\varepsilon_i=X_2\beta_2+\varepsilon_i & \qquad\text{where} \quad X_2 = [\ell\;D_2] \\ Y_i =& \beta_{01}D_{1i} + \beta_{02}D_{2i}+\varepsilon_i=X_0\beta_0+\varepsilon_i & \qquad\text{where} \quad X_0 = [D_1\;D_2] \end{aligned} \]

It turns out that,

\[ P_1 = X_1(X_1'X_1)^{-1}X_1' = P_2 = P_0 \]

Similarly,

\[ M_1 = I_n-X_1(X_1'X_1)^{-1}X_1' = M_2 = M_0 \]

Thus, the vector of residuals estimated by OLS for the above three models are all the same.

Why does this matter? Consider, the case where you include a categorical covariate (‘control’ variable) in your model. We know from EC226 that a categorical variable must be included in the model as a set of dummy variables: one for each category.

\[ Y_i = \alpha + \beta Z_i + \sum_{j=2}^J \gamma_j D_{ji} + \upsilon_i \]

We must exclude 1 of the J categories since the model contains a constant and all the dummy variables (for the \(J\) categories) add up to 1 (or the vector \(\ell\)). Here, I exclude the first category, \(j=1\).

We know from the previous discussion on partitioned regression that we can estimate the scalar parameter \(\beta\) by first regressing \(Z_i\) on all other regressors,

\[ Z_i = \delta + \sum_{j=2}^J \psi_j D_{ji} + \nu_i \]

Next we regress the outcome, \(Y_i\), on the residuals from the above equation. Using vector notation, this model is given by,

\[ Y_i = \beta \hat{\nu_i}+\zeta_i = \beta M_{D_1}Z_i+\zeta_i \]

where \(M_{D_1}\) is the orthogonal projection matrix from the regression of \(Z_i\) on the full set of dummy variables (excluding category 1) and constant. The OLS estimator for \(\beta\) is given by,

\[ \hat{\beta} = (Z'M_{D_1}Z)^{-1}Z'M_{D_1}Y \]

There are two things to take from this equation. First, we get the same result whether or not we first orthogonalize \(Y_i\) on the full set of covariates, since \(M_D\) is idempotent: \(M_{D_1} = M_{D_1}M_{D_1}\). Thus,

\[ (Z'M_{D_1}Z)^{-1}Z'M_{D_1}M_{D_1}Y = (Z'M_{D_1}Z)^{-1}Z'M_{D_1}Y = \hat{\beta} \]

Second, the choice of base category among the control variables is irrelevant. Indeed, even if we dropped the constant and included a dummy variable for all \(J\) categories, we would achieve the same estimator. This is because,

\[ M_{D_0} = M_{D_1} = M_{D_2} ... = M_{D_J} \]

where \(M_{D_0}\) is the orthogonal projection matrix from a regression with all \(J\) categories (and no constant) and \(M_{D_j}\) is the orthogonal projection matrix from a regression with the \(j\)-th category excluded (and a constant).

5.1 Fixed Effects

One way of interpreting a model with an (exhaustive) set of dummy variables is as a model with a group-specific constant. Since, it makes no difference to parameter of interest whether you include the constant and \(J-1\) dummies or exclude the constant and include all \(J\) dummy variables, we can consider the model,¹

\[ Y_i = \beta Z_i + \sum_{j=1}^J \gamma_j D_{ji} + \upsilon_i \]

A common short-hand notation for such a set-up is,

\[ Y_i = \beta Z_i + \gamma_j + \upsilon_i \]

where \(\gamma_j\) signifies the presence of \(J\) group fixed effects. For each group the constant, denoted by \(\gamma\), changes. However, it is important to remember that this notation is a short-hand. The full set of regressors includes \(J\) dummy variables (or a constant and \(J-1\) dummy variables). When applying this shorthand it is standard to drop the constant term.

5.2 Multi-level Fixed Effects

Suppose you had two categorical variables, where one was a proper subset of the other. For example, an individual level dataset that contained information on the country in the UK where an individual lived (i.e. England, Scotland, Wales, Northern Ireland) and their county (i.e. Warwickshire, Oxfordshire, Cambridgeshire, etc.). Since county borders do not overlap country borders in the UK, an individuals county perfectly predicts their country.

Consider then the model,

\[ Y_i = \beta Z_i + \gamma_j + \delta_k + \upsilon_i \]

where \(j\) represents country fixed effects and \(k\) county fixed effects. Let’s assume the data is sorted by country and then county, and that there are two countries, each with 2 counties. (An obvious simplification.) If we consider the matrix of fixed effects, they can be written as,

\[ \begin{bmatrix}1 & 0 & 1 & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ \vdots & \vdots & 1 & 0 & 0 & 0 \\ \vdots & \vdots & 0 & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ \vdots & \vdots & 0 & 0 & 1 & 0 \\ \vdots & \vdots & 0 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0& 1 & 0 & 0 & 0 & 1 \\ \end{bmatrix} \]

We can immediately see that the matrix is rank deficient. Columns 3 and 4 add up to give you column 1. Likewise, columns 5 and 6 add up to give you column 2. This implies that the matrix is not invertible and we cannot separately identify all fixed effects.

If we include column 1 (i.e. dummy variable for country \(j=1\)), we must exclude either column 3 or 4 (i.e. one of the counties in country \(j=1\)). For the same reasons discussed above (when selecting between the inclusion of the constant and dummy variables), it makes no difference whether we include column 1 and drop either column 3 or 4; or keep columns 3 and 4, and exclude column 1. Thus, we can only identify the model,²

\[ Y_i = \beta Z_i + \tilde{\delta}_k + \upsilon_i \] The general rule is, you retain the “highest” level of fixed effect. In this application, the model with county fixed effects.

5.3 Fixed Effects in Panel Data

Fixed effects are used extensively in panel data models. They are used to control for time-invariant, unit-level heterogeneity and flexible, aggregate time trends. Consider, the model

\[ Y_{it} = \beta Z_{it} + \alpha_i + \delta_t + \varepsilon_{it} \]

As before, this notation is used as a shorthand. The full set of unit fixed effects are given by,

\[ \alpha_i=\sum_{j=1}^N\alpha_j\mathbf{1}\{i=j\} \]

and the time fixed effects are given by,

\[ \delta_t=\sum_{j=1}^T\delta_k\mathbf{1}\{t=k\} \]

What if we wanted to include group-level controls in the model; e.g., an indicator for treatment group status? If group membership is stable over time, then for the reasons discussed above, group membership is perfectly co-linear with the unit fixed effects. The dummy variables for treated units (a subset of the unit fixed effects) add up to the dummy variable for the treated group. You will see this again in EC338.

A final point on this. What if you need to include a variable that identifies treatment-group status in the model for identification. For example, in a simple two-group-two-period difference-in-differences model you have,

\[ Y_{it} = \alpha + \psi D_i + \delta T_t + \beta D_i\cdot T_t + \varepsilon_{it} \]

where \(D_i = \mathbf{1}\{treated\}\) and \(T_t = \mathbf{1}\{t\geq t_0\}\) where \(t_0\) is the period of treatment. The coefficient on the interaction term is the parameter of interest. If we include unit fixed effects in this model,

\[ Y_{it} = \alpha_i + \delta T_t + \beta D_i\cdot T_t + \upsilon_{it} \]

then we drop the treatment-group indicator dummy variable: \(D_i\). And if we include include time fixed effects,

\[ Y_{it} = \alpha_i + \delta_t + \beta D_i\cdot T_t + \upsilon_{it} \]

we drop \(T_t\). Is this a problem for identification? No, the group indicator, \(D_i\), is “implied” by unit fixed effects. Because the group dummy is perfectly co-linear with the unit fixed effects, you could always have included it by excluding one of the unit fixed effects. In fact, with a balanced panel, replacing the group indicator with unit fixed effects will not actually change the estimated coefficient on the interaction term. It will, however, reduce the standard error of estimator.

For this reason, the most common notation used for (static) difference-in-differences is,

\[ Y_{it} = \alpha_i + \delta_t + \beta D_{it} + \upsilon_{it} \]

where \(D_{it} = \mathbf{1}\{treated\}\cdot\mathbf{1}\{t\geq t_0\}\). This notation is particularly useful because it does not need to be adapted when you add more time periods. However, it can only be used in applications with longitudinal data. If you are using repeated cross-sections then you cannot include unit fixed effects and must retain a treatment group indicator. We will discuss this further in EC338.

The choice of base category makes a difference when the variable of interest is a categorical variable. In this instance, the choice of base category changes the interpretation of the coefficient. This discussion relates more to instances where the categorical variable is included as a control variable.↩︎
I deliberately changed the parameters denoting the county fixed effects from \(\delta_k\) to \(\tilde{\delta}_k\) since the true model (data generating process) may contain both country and county fixed effects. However, we cannot separately identify these. We can identify \(\tilde{\delta}_k\): a combination of both sets of parameters.↩︎