Introduction to Stata

16 Oct 2024

This exercise sheet is designed as an introduction to Stata for those studying Econometrics A & B.

Why Stata?

Stata is a powerful and reliable statistical analysis software. It is not a language, like R or Python, but does have command-based functionalities. Those who are familiar with Stata’s commands, will effectively be able to programme in scripts called do-files (named for their file extension “.do”). Stata does have its own mathematical (compiled) language, called Mata, which is used in the programming of many commands. It also includes integration for C, C++, R, and Python.

Stata remains the dominant software/language in applied Economics research (see R-bloggers post). This is a product of history (i.e. luck) and some of Stata’s key strengths. It is a software designed primarily for the analysis of cross-sectional and panel datasets, and has grown its market share with the rise in availability of digitized household survey data (beginning in the 1980s). Stata is very efficient at analyzing small-to-medium sized datasets. The efficiency derives from the fact that it loads the data into memory. With modern computers, that typically have >8GB of RAM, you will have no problem analyzing a household survey.

However, this efficiency is also it’s “Achilles’ heel”. Stata is losing market share in the age of “big data”. While there are more efficient ways to analyse big-ger datasets in Stata, it remains bound by the amount of memory available.

Stata’s dominance in empirical research also relates to its peer reviewed nature. Commands (packages) that come pre-installed in Stata, along with those installed from Statistical Software Components (using the command ssc install), are heavily vetted. For this reason, Stata is used by consultancies that conduct research for legal cases.

As the dominant research software, it is important that you have a working knowledge of Stata, even if you are fluent in other programming languages. There might be a research assistant opportunity in your future that requires you to work with Stata. For those less confident with ‘programming’, Stata is an easier introduction to data analysis.

In the market today, a wider range of software and programming-language experience is extremely valuable. We encourage you to develop your knowledge of Stata while investing in languages like R and Python.

Problem Set 1

The small dataset (filename “data.dta”)1 is based on the Living Cost and Food Survey, 2013 (LCF). The LCF, conducted by the Office for National Statistics, collects information on spending patterns and the cost of living that reflects household budgets across the UK. This teaching dataset is a subset which has been subject to certain simplifications and additions, for learning and teaching by CMIST, Manchester.

This problem set will review a sample of helpful Stata commands. In addition, a do-file with some comments will be provided along with the exercises. Please work through this exercise in conjunction with the do-file. Some of the commands listed below may seem irrelevant for the analysis of this particular dataset. They have been included because of their usefulness in other settings.

Preamble

a. Create a folder on your computer where you intend to save this project. For example, “…/EC910/Seminars/Seminar-1”.

b. Download the dataset from Moodle - “data.dta” - and save/move it to the above folder.

c. Open Stata

d. Open a new do-file, either using the do-file icon (image of pen and paper) or “Window>Do-file Editor>New Do-file Editor”.

e. Save the new do-file to the project folder; e.g. “problem-set-1.do”.

f. Give the do file a title, author, and date. The treat text as a comment, you need to use the symbols *, //, or /* followed by */. For example,

. * ──────────────────────────────────────────────────────────────────────────── *
. * title: Problem Set 1
. * author: Neil Lloyd
. * date: 14 October 2024
. * ──────────────────────────────────────────────────────────────────────────── *

g. Open the dataset “data.dta”. You can go to “File>Open”, then locate the dataset on your computer and then select “Open”. However, doing so is not replicable. Instead, you can use the use command from within your do-file.

Add the following line to your do-file:

use "filepath/data.dta", clear

subsituting in the word “filepath” with the full filepath of your data. Alternatively, you can set a default directory for the project using cd. If set, this will become the default directory where Stata searches for a file or saves a file.

cd "filepath"

use data.dta, clear

Note, double quotations over the file path/name are only necessary when the path/name contains blank spaces; e.g. “problem set 1 data.dta”. This is one reason to avoid spaces in names.

Second, the option clear tells Stata to remove any existing dataset from memory before opening this one. You cannot open two datasets simultaneously. Any changes to the existing dataset will be lost unless saved. But this is not necessarily a bad thing. Read on.

h. Open a log file in which to record the output from Stata. If you have set the default directory using cd, you can do so as follows:

log using problem-set-1-log, replace

This will save the file using Stata’s own log-file format (file extension “.scml”). It’s better to save them as a simple text-file:

log using problem-set-1-log, replace text

or

log using problem-set-1-log.txt, replace

In this way, they open in a any text editor, like Notepad.

When you open a log-file, it will record all output (in the output window), until you tell it to stop: log close. This should appear at the end of your do-file.

As a suggestion, including the following before you open a log-file:

cap log close

log using problem-set-1-log.txt, replace

followed by,

log close

at the end. We won’t explain why, but it will help you avoid an annoying feature of how log-files and do-files interact.

Important principle

When you open a dataset in Stata, the software creates a copy in memory. Any changes you make to this copy will not affect the version you have saved on your harddrive, UNLESS you save those changes. It is a good principle to not save changes to the dataset, but rather to keep a copy of all changes made as a list of commands in a saved do-file. This ensures replicability.

If you want to create a new dataset, save it under a new name. The command below will do just this.

save new-data.dta, replace

When you close Stata it will ask you if you want to save changes to the opened dataset. Unless you have a clear reason to so, SAY NO. Ensure that your do-file is saved, well organized, and well commented. As a new feature of Stata 18, the software keeps a back-up of unsaved changes to the do-file (with file extension “.stswp”) in case the software or your computer crashes.

Part 1: Review the data

1.1. Review the data. You can either use the browse-icon (image of table and magnifying glass) or “Data>Data Editor>Data Editor (Browse)”.2 Else, in your do-file (or the main window’s command line) type,

. browse

or

. br

Stata does not mind if you shorten the command, so long as there is no ambiguity.

browse is not a great command to keep in a do-file. You want your do-file to run like a programme and browse means it will open up a visualization of the data each time you run the file, thereby slowing down future operations.

1.2. Use the describe (or simply de) command to learn about the variables in the dataset. Do you know what any of this information means?

. des

Contains data from C:\Users\neil_\OneDrive - University of Warwick\Documents\EC910\seminars\ps1\data.dta
 Observations:         5,144                  
    Variables:            14                  20 Sep 2020 13:54
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Variable      Storage   Display    Value
    name         type    format    label      Variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
casenew         int     %8.0g                 New random ID number
P550tpr         double  %12.0g                Total expenditure (top coded)
P344pr          double  %12.0g                Gross normal weekly household income (top coded)
P425r           byte    %8.0g      P425r      Main source of household income (recoded)
A172            byte    %8.0g      A172       Internet connection in household
A093r           byte    %8.0g      A093r      Economic position of Household Reference Person (recoded)
A094r           byte    %8.0g      A094r      NS-SEC 3 Class of Household Reference Person (recoded from NS-SEC 12)
A121r           byte    %8.0g      A121r      Tenure - type (recoded)
SexHRP          byte    %8.0g      SexHRP     Sex of Household Reference Person
A049r           byte    %8.0g      A049r      Household size; number of persons in HH (recoded)
G018r           byte    %8.0g      G018r      Number of adults (recoded)
G019r           byte    %8.0g      G019r      Number of children (recoded)
Gorx            byte    %8.0g      Gorx       Government Office Region modified
hhsize          byte    %8.0g                 Household size, number of people in household (recoded)formerly A049r
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

1.3. Use the summarize (or simply su) to recover some basic summary statistics of the variable P550tpr.

. sum P550tpr

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
     P550tpr │      5,144    479.7584    292.3652     30.525       1175

1.4. Use the detail option to learn more about the variable P344pr. Are there any abnormal values or outliers?

. sum P344pr, det

      Gross normal weekly household income (top coded)
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%           56              0
 5%        138.1              0
10%       180.03              0       Obs               5,144
25%     314.0238              0       Sum of wgt.       5,144

50%       563.15                      Mean           620.4336
                        Largest       Std. dev.      359.1557
75%      928.087        1184.99
90%      1184.99        1184.99       Variance       128992.8
95%      1184.99        1184.99       Skewness       .2958553
99%      1184.99        1184.99       Kurtosis       1.802326

1.5. The summarize command doesn’t tell you anything about missing data. Check to see if the number of observations above matches the total observations in the dataset using count. Was there any missing data?

. count
  5,144

1.6. List the first 10 values of variables P550tpr and P344pr in the dataset using the command `list’.

. list P550tpr P344pr in 1/10

     ┌───────────────────────┐
     │     P550tpr    P344pr │
     ├───────────────────────┤
  1. │     255.635    534.39 │
  2. │ 305.9577537    781.14 │
  3. │ 240.2902021    445.62 │
  4. │ 501.4998985     778.4 │
  5. │ 465.2381126    643.84 │
     ├───────────────────────┤
  6. │ 524.1410067   1094.18 │
  7. │ 255.2775307    525.73 │
  8. │ 269.2049452       200 │
  9. │ 413.7603846     556.6 │
 10. │ 591.3247581    1105.6 │
     └───────────────────────┘

1.7. Make a simple frequency table of the values in A121r using the command tabulate.

. tab A121r

 Tenure - type │
     (recoded) │      Freq.     Percent        Cum.
───────────────┼───────────────────────────────────
 Public rented │        880       17.11       17.11
Private rented │        798       15.51       32.62
         Owned │      3,466       67.38      100.00
───────────────┼───────────────────────────────────
         Total │      5,144      100.00

Here are a list of other commands to help you navigate and learn about the dataset:

Part 2: Graphing Basics

2.1. Make a histogram depicting the frequency of values in the variable P550tpr. Try make it so that the y-axis is in percentage-points and not density (Stata’s default). To check the options of a command, type: help histogram.

. hist P550tpr, percent
(bin=37, start=30.525, width=30.931757)

2.2. Modify the above graph so that the width of each bin is £10. Which graph is more informative?

. hist P550tpr, percent width(10)
(bin=115, start=30.525, width=10)

2.3. Use Stata’s flexible twoway graph function to create a scatter plot of of the relationship bewteen P550tpr and P344pr.

. scatter P550tpr P344pr

2.4. Overlay the above scatter plot with a line graph of the linear fit between the same two variables. Hint: use the graphing command lfit. If you can make the colour of the fitted line red.

. twoway (scatter P550tpr P344pr) (lfit P550tpr P344pr, lc(red)), legend(pos(6) row(1))

2.5. Replicate 2.3., weighting each observation by household size. That is, give more weight to larger households. Add a note to the graph that explains this weighting. Play around with the marker fill and outline colours so that the graph is more readable. For example, you can adjust the opacity of the fill colour of each marker by including the option: mfc(blue%20).

. scatter P550tpr P344pr [w = hhsize], mfc(teal%10) mlc(teal)  note("Weighted by household size")
(analytic weights assumed)
(analytic weights assumed)
(analytic weights assumed)

Part 3: Generating variables

In Stata you the common command for creating a new variables is generate. There are other operators, like egen and recode, that have more specific functions.

3.1. Use the gen command to create a new variable equal to the consumption expenditure per pound of income variable apc=P550tpr/P344pr. Next, label the variable “Average Propensity to Consume”. Hint: type help label. To check that the new label has been applied, type describe apc to check that it has worked.

. gen apc = P550tpr/P344pr
(12 missing values generated)

. lab var apc "Average Propensity to Consume"

. des apc

Variable      Storage   Display    Value
    name         type    format    label      Variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
apc             float   %9.0g                 Average Propensity to Consume

3.2. The variable p550tpr contains some very strange values at the upper end of the distribution: type sum P550tpr, det. Use the command _pctile p550tpr, p(99) in order to find the value above which 1% of points of P550tpr lie.

. sum P550tpr, det

                Total expenditure (top coded)
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%        64.05         30.525
 5%     111.0905         31.952
10%     153.5125       32.13598       Obs               5,144
25%     254.0006         32.355       Sum of wgt.       5,144

50%     419.9035                      Mean           479.7584
                        Largest       Std. dev.      292.3652
75%     645.0512           1175
90%     934.4652           1175       Variance       85477.43
95%     1171.163           1175       Skewness       .8262061
99%         1175           1175       Kurtosis       2.937132

. _pctile P550tpr, p(99)

3.3. Use the above information to create a new variable (called exp) that is a duplicate of P55tpr, but replaces the top 1% of values to missing. In Stata missing values for numerical variables take the value “.”. You can also use {“.a”, “.b”,…} if you want to assign different categories of missing. For string variables, a missing value is just an empty string: ““.

Try to avoid using explicit numerical values in your code. For example, instead of copying the 99th percentile from question 14, use the Stata’s stored value. The command _pctile is an r-class command (many estimation commands are e-class). You can see the stored values with: return list.

. return list

scalars:
                 r(r1) =  1175

. gen exp = P550tpr if P550tpr < r(r1)
(256 missing values generated)

. 
. * alternatively, 
. gen exp2 = P550tpr

. replace exp2 = . if P550tpr >= r(r1)
(256 real changes made, 256 to missing)

. compare exp exp2

                                        ────────── Difference ──────────
                            Count       Minimum      Average     Maximum
────────────────────────────────────────────────────────────────────────
exp=exp2                     4888
                       ──────────
Jointly defined              4888             0            0           0
Jointly missing               256
                       ──────────
Total                        5144

. 
. * check:
. sum exp, det

                             exp
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%     63.52271         30.525
 5%     109.0927         31.952
10%     149.5541       32.13598       Obs               4,888
25%     245.8819         32.355       Sum of wgt.       4,888

50%      402.132                      Mean           443.3464
                        Largest       Std. dev.      251.6121
75%     594.9666       1168.479
90%     810.3289        1169.66       Variance       63308.64
95%     942.6381       1171.163       Skewness       .6960889
99%     1108.645       1172.356       Kurtosis       2.857423

3.4. Produce a suitably labelled histogram of exp.

. hist exp, xtitle(Expenditure) percent fc(maroon%50) lc(maroon) width(10) xlabel(0(100)1200) xmtick(0(10)1200)
(bin=115, start=30.525, width=10)

3.5. Using the approach from 3.3, replace the top 1% of values in apc to missing.

. _pctile apc, p(99)

. return list

scalars:
                 r(r1) =  3.661458730697632

. replace apc = . if apc>=r(r1)
(52 real changes made, 52 to missing)

3.6. Produce suitably labelled horizontal histogram of apc and export this graph as a .pdf file. Hint: help graph export.

. hist apc, xtitle(Propensity to consume) percent horizontal width(0.1) ymtick(0(0.1)4) fc(gold%50) lc(gold)
(bin=36, start=.1207175, width=.1)

. graph export hist-apc.pdf, replace
file hist-apc.pdf saved as PDF format

Part 4: Summary statistics

4.1. Use the tabulate and/or table commands to learn about the frequency distributions of P425r, A172, SexHRP, Gorx, and A049r. Note, if you you can also cross-tabulate two variables; for example, tab P435r A172.

. tab P425r

  Main source │
 of household │
       income │
    (recoded) │      Freq.     Percent        Cum.
──────────────┼───────────────────────────────────
Earned income │      2,841       55.23       55.23
 Other income │      2,303       44.77      100.00
──────────────┼───────────────────────────────────
        Total │      5,144      100.00

. tab A172

   Internet │
 connection │
         in │
  household │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
        Yes │      4,232       82.27       82.27
         No │        912       17.73      100.00
────────────┼───────────────────────────────────
      Total │      5,144      100.00

. tab SexHRP

     Sex of │
  Household │
  Reference │
     Person │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
       Male │      3,156       61.35       61.35
     Female │      1,988       38.65      100.00
────────────┼───────────────────────────────────
      Total │      5,144      100.00

. tab Gorx

 Government Office Region │
                 modified │      Freq.     Percent        Cum.
──────────────────────────┼───────────────────────────────────
               North East │        251        4.88        4.88
North West and Merseyside │        585       11.37       16.25
 Yorkshire and the Humber │        462        8.98       25.23
            East Midlands │        424        8.24       33.48
            West Midlands │        526       10.23       43.70
                  Eastern │        497        9.66       53.36
                   London │        480        9.33       62.69
               South East │        681       13.24       75.93
               South West │        429        8.34       84.27
                    Wales │        246        4.78       89.06
                 Scotland │        412        8.01       97.06
         Northern Ireland │        151        2.94      100.00
──────────────────────────┼───────────────────────────────────
                    Total │      5,144      100.00

. tab A049r

 Household size; │
       number of │
   persons in HH │
       (recoded) │      Freq.     Percent        Cum.
─────────────────┼───────────────────────────────────
        1 person │      1,434       27.88       27.88
        2 person │      1,926       37.44       65.32
        3 person │        779       15.14       80.46
        4 person │        670       13.02       93.49
5 person or more │        335        6.51      100.00
─────────────────┼───────────────────────────────────
           Total │      5,144      100.00

4.2. Report summary statistics for the variable exp separately by main source of income (P425r). Since, exp is a continuous variable you do not want to use tabulate. Instead, use the summarize command. This can be combined with a categorical in a few ways: (1) tab catvar, sum(continvar); bysort catvar: sum continvar, det; (3) table catvar, stat(mean continvar sd continvar count continvar). The table option is the most flexible, as there is a wider range of statistics available; including percentiles.

. tab P425r, sum(exp)

Main source │
         of │
  household │
     income │           Summary of exp
  (recoded) │        Mean   Std. dev.       Freq.
────────────┼────────────────────────────────────
  Earned in │   540.39831   237.40254       2,637
  Other inc │   329.65214   217.95931       2,251
────────────┼────────────────────────────────────
      Total │   443.34642   251.61209       4,888

. bysort P425r: sum exp, det

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Earned i

                             exp
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%     125.4465         38.945
 5%     211.6911       49.90686
10%     252.5987         62.945       Obs               2,637
25%      362.783       64.01518       Sum of wgt.       2,637

50%     503.3182                      Mean           540.3983
                        Largest       Std. dev.      237.4025
75%     688.3661       1168.479
90%     891.5864        1169.66       Variance       56359.97
95%     997.3179       1171.163       Skewness       .5331345
99%     1126.041       1172.356       Kurtosis       2.665763

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Other in

                             exp
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%        52.56         30.525
 5%       84.535         31.952
10%       107.92       32.13598       Obs               2,251
25%     169.7276         32.355       Sum of wgt.       2,251

50%     271.4606                      Mean           329.6521
                        Largest       Std. dev.      217.9593
75%     432.3063       1147.506
90%     646.2236       1154.613       Variance       47506.26
95%      780.106       1159.285       Skewness       1.265761
99%     1018.757       1161.778       Kurtosis       4.424679


. table P425r, stat(mean exp) stat(sd exp) stat(count exp)

──────────────────────────────────────────┬─────────────────────────────────────────────────────────────
                                          │      Mean   Standard deviation   Number of nonmissing values
──────────────────────────────────────────┼─────────────────────────────────────────────────────────────
Main source of household income (recoded) │                                                             
  Earned income                           │  540.3983             237.4025                         2,637
  Other income                            │  329.6521             217.9593                         2,251
  Total                                   │  443.3464             251.6121                         4,888
──────────────────────────────────────────┴─────────────────────────────────────────────────────────────

4.3. Produce a suitably labelled histogram of exp separately for households according to their main source of income (P425r).

. hist exp, by(P425r, note("") title("Distribution of expenditure by main source of income")) percent  xtitle(Expenditure) lc(eltgreen) fc(eltgreen%50)

4.4. Construct summary statistics for exp for households according to both the main source of income (P425r), and internet connection (A172). Here, it is best to use the table command.

. table P425r A172, stat(mean exp)

──────────────────────────────────────────┬──────────────────────────────────────
                                          │    Internet connection in household  
                                          │         Yes           No        Total
──────────────────────────────────────────┼──────────────────────────────────────
Main source of household income (recoded) │                                      
  Earned income                           │    554.3448     321.5789     540.3983
  Other income                            │    389.7849     209.0654     329.6521
  Total                                   │    492.2577     228.6654     443.3464
──────────────────────────────────────────┴──────────────────────────────────────

. 
. * alternatively, 
. bys P425r A172 : sum exp

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Earned i, A172 = Yes

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         exp │      2,479    554.3448    234.2653     38.945   1172.356

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Earned i, A172 = No

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         exp │        158    321.5789    169.7532   64.01518   1034.258

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Other in, A172 = Yes

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         exp │      1,502    389.7849    226.8636      45.58   1161.778

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Other in, A172 = No

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         exp │        749    209.0654    133.4172     30.525   982.8839

4.5. Export an the table from 4.2. to an Excel spreadsheet using dtable.

. dtable exp, export(sumstats, as(xlsx) replace) by(P425r)

─────────────────────────────────────────────────────────
          Main source of household income (recoded)      
      Earned income      Other income         Total      
─────────────────────────────────────────────────────────
N       2,841 (55.2%)     2,303 (44.8%)    5,144 (100.0%)
exp 540.398 (237.403) 329.652 (217.959) 443.346 (251.612)
─────────────────────────────────────────────────────────
(collection DTable exported to file sumstats.xlsx)

Part 5: Basic hypothesis tests

5.1. At the 1% significance level, test the hypothesis of no difference in exp between households with earnings and other sources as their main source of income (P425r). Hint: help ttest. Do consider the parameters of the test: one-side or two-sided, equal or unequal variance, significance level.

. ttest exp, by(P425r) unequal level(99)  

Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
   Group │     Obs        Mean    Std. err.   Std. dev.   [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Earned i │   2,637    540.3983    4.623068    237.4025    528.4814    552.3152
Other in │   2,251    329.6521    4.593965    217.9593    317.8088    341.4955
─────────┼────────────────────────────────────────────────────────────────────
Combined │   4,888    443.3464    3.598868    251.6121    434.0727    452.6201
─────────┼────────────────────────────────────────────────────────────────────
    diff │            210.7462    6.517459                193.9517    227.5406
─────────┴────────────────────────────────────────────────────────────────────
    diff = mean(Earned i) - mean(Other in)                        t =  32.3356
H0: diff = 0                     Satterthwaite's degrees of freedom =  4860.16

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

5.2. At the 1% significance level, test the hypothesis of no differences in exp between households with earnings and other sources as their main source of income (P425r) by internet connection (A172).

. bys A172 : ttest exp, by(P425r) unequal level(99)

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> A172 = Yes

Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
   Group │     Obs        Mean    Std. err.   Std. dev.   [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Earned i │   2,479    554.3448    4.705109    234.2653    542.2159    566.4737
Other in │   1,502    389.7849    5.853691    226.8636    374.6876    404.8822
─────────┼────────────────────────────────────────────────────────────────────
Combined │   3,981    492.2577    3.880372    244.8327    482.2577    502.2577
─────────┼────────────────────────────────────────────────────────────────────
    diff │            164.5599    7.510243                145.2034    183.9164
─────────┴────────────────────────────────────────────────────────────────────
    diff = mean(Earned i) - mean(Other in)                        t =  21.9114
H0: diff = 0                     Satterthwaite's degrees of freedom =  3246.26

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> A172 = No

Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
   Group │     Obs        Mean    Std. err.   Std. dev.   [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Earned i │     158    321.5789    13.50484    169.7532    286.3649    356.7928
Other in │     749    209.0654     4.87496    133.4172    196.4763    221.6546
─────────┼────────────────────────────────────────────────────────────────────
Combined │     907    228.6654    4.870152    146.6717    216.0942    241.2365
─────────┼────────────────────────────────────────────────────────────────────
    diff │            112.5134    14.35779                75.17385     149.853
─────────┴────────────────────────────────────────────────────────────────────
    diff = mean(Earned i) - mean(Other in)                        t =   7.8364
H0: diff = 0                     Satterthwaite's degrees of freedom =  199.869

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

5.3. From the variable P344tp create a suitably labelled binary variable inc_m which is =1 if above median income, and =0 otherwise. Label the values 0 and 1 within this variable.

. _pctile P344pr, p(50)   /*50 says it is the median that is being saved in r(r1) */

. gen inc_m=(P344pr>`r(r1)')   /* creates a dummy for those observations above the median */

. lab var inc_m "Above median income"

. tab inc_m

      Above │
     median │
     income │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          0 │      2,572       50.00       50.00
          1 │      2,572       50.00      100.00
────────────┼───────────────────────────────────
      Total │      5,144      100.00

. lab def inc_m 0 "Below median" 1 "above median"

. lab val inc_m inc_m

. tab inc_m

Above median │
      income │      Freq.     Percent        Cum.
─────────────┼───────────────────────────────────
Below median │      2,572       50.00       50.00
above median │      2,572       50.00      100.00
─────────────┼───────────────────────────────────
       Total │      5,144      100.00

5.4. At the 1% significance level, test for a difference in the mean expenditure (exp) of those above and below median income.

. ttest exp, by(inc_m) unequal level(99)

Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
   Group │     Obs        Mean    Std. err.   Std. dev.   [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Below me │   2,560    299.5183    3.491319    176.6483    290.5186    308.5181
above me │   2,328    601.5079    4.673948    225.5149    589.4587    613.5571
─────────┼────────────────────────────────────────────────────────────────────
Combined │   4,888    443.3464    3.598868    251.6121    434.0727    452.6201
─────────┼────────────────────────────────────────────────────────────────────
    diff │           -301.9895     5.83396               -317.0233   -286.9557
─────────┴────────────────────────────────────────────────────────────────────
    diff = mean(Below me) - mean(above me)                        t = -51.7641
H0: diff = 0                     Satterthwaite's degrees of freedom =  4402.02

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

5.5. From the variable P344tp create a categorical variable inc_cat based on the quintiles of P344tp. Label the variableinc_cat and the 5 values within this variable. You can either do this manually or explore help _pctile.

. * manual approach
. _pctile P344pr, p(20, 40, 60, 80)

. gen inc_cat=1 if( P344pr<`r(r1)')
(4,116 missing values generated)

. replace inc_cat=2 if(P344pr>=`r(r1)' & P344pr<`r(r2)')
(1,029 real changes made)

. replace inc_cat=3 if(P344pr>=`r(r2)' & P344pr<`r(r3)')
(1,029 real changes made)

. replace inc_cat=4 if(P344pr>=`r(r3)' & P344pr<`r(r4)')
(1,029 real changes made)

. replace inc_cat=5 if(P344pr>=`r(r4)')
(1,029 real changes made)

. tab inc_cat

    inc_cat │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          1 │      1,028       19.98       19.98
          2 │      1,029       20.00       39.99
          3 │      1,029       20.00       59.99
          4 │      1,029       20.00       80.00
          5 │      1,029       20.00      100.00
────────────┼───────────────────────────────────
      Total │      5,144      100.00

. lab var inc_cat "Income quintile"

. lab def inc_cat 1 "1st" 2 "2nd" 3 "3rd" 4 "4th" 5 "5th"

. lab val inc_cat inc_cat

. tab inc_cat

     Income │
   quintile │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
        1st │      1,028       19.98       19.98
        2nd │      1,029       20.00       39.99
        3rd │      1,029       20.00       59.99
        4th │      1,029       20.00       80.00
        5th │      1,029       20.00      100.00
────────────┼───────────────────────────────────
      Total │      5,144      100.00

. 
. * xtile option
. xtile inc_cat2 = P344pr, n(5)

. 
. * check they are the same
. tab inc_cat inc_cat2, m

    Income │                 5 quantiles of P344pr
  quintile │         1          2          3          4          5 │     Total
───────────┼───────────────────────────────────────────────────────┼──────────
       1st │     1,028          0          0          0          0 │     1,028 
       2nd │         1      1,028          0          0          0 │     1,029 
       3rd │         0          1      1,028          0          0 │     1,029 
       4th │         0          0          1      1,028          0 │     1,029 
       5th │         0          0          0          1      1,028 │     1,029 
───────────┼───────────────────────────────────────────────────────┼──────────
     Total │     1,029      1,029      1,029      1,029      1,028 │     5,144 

5.6. Using inc_cat, produce a suitably labelled plot of the mean of exp for each of the 5 income categories. Try to use the graph bar function.

. graph bar (mean) exp, over(inc_cat) blabel(bar)

5.7. Produce a plot the mean of exp against inc_cat, separately by internet connection (A172).

. label def internet 1 "Has internet" 2 "No internet"

. lab val A172 internet

. graph bar (mean) exp, over(inc_cat) blabel(bar, format(%3.0f)) by(A172, title("Average expenditure by income quintile" "and internet connection") note("")) ytitle(Expenditure)  inte
> nsity(*.5) bar(1,color(erose))

. * alternatively,
. graph bar (mean) exp, over(inc_cat) over(A172) blabel(bar, format(%3.0f)) ytitle(Expenditure) title("Average expenditure by income quintile" "and internet connection")  intensity(*.
> 5) bar(1,color(emidblue))

Part 6: Loops

A loop is an operator that repeats the same operation through a given sequence. The three main loops are foreach, forvalues, and while. Loops can be used to make code more efficient and also in the design of your own programme/function (see help program). Within a loop, you must refer to the running argument using local notation. For example, the following loop creates a variable that is equal to it’s row value for rows 1 through 10.

. gen number = .
(5,144 missing values generated)

. forvalues i = 1/10 {
  2. dis "Number `i'"
  3. replace number  = `i' in `i'
  4. }
Number 1
(1 real change made)
Number 2
(1 real change made)
Number 3
(1 real change made)
Number 4
(1 real change made)
Number 5
(1 real change made)
Number 6
(1 real change made)
Number 7
(1 real change made)
Number 8
(1 real change made)
Number 9
(1 real change made)
Number 10
(1 real change made)

. list number in 1/10

     ┌────────┐
     │ number │
     ├────────┤
  1. │      1 │
  2. │      2 │
  3. │      3 │
  4. │      4 │
  5. │      5 │
     ├────────┤
  6. │      6 │
  7. │      7 │
  8. │      8 │
  9. │      9 │
 10. │     10 │
     └────────┘

The curly brackets denote the beginning an end of the actions included in each loop. You can use the display command like print() from other programming languages. If you want to add conditions to a loop, you can do so using if, else if, and else statements. For example,

. gen str even = ""
(5,144 missing values generated)

. forvalues i = 1/10 {
  2. if mod(`i',2) == 0 { 
  3. dis "Number `i' is even"
  4. replace even  = "even" in `i'
  5. }
  6. else {
  7. dis "Number `i' is odd"    
  8. replace even = "odd" in `i'
  9. }
 10. }
Number 1 is odd
variable even was str1 now str3
(1 real change made)
Number 2 is even
variable even was str3 now str4
(1 real change made)
Number 3 is odd
(1 real change made)
Number 4 is even
(1 real change made)
Number 5 is odd
(1 real change made)
Number 6 is even
(1 real change made)
Number 7 is odd
(1 real change made)
Number 8 is even
(1 real change made)
Number 9 is odd
(1 real change made)
Number 10 is even
(1 real change made)

. list number even in 1/10

     ┌───────────────┐
     │ number   even │
     ├───────────────┤
  1. │      1    odd │
  2. │      2   even │
  3. │      3    odd │
  4. │      4   even │
  5. │      5    odd │
     ├───────────────┤
  6. │      6   even │
  7. │      7    odd │
  8. │      8   even │
  9. │      9    odd │
 10. │     10   even │
     └───────────────┘

6.1. Repeat exercise 4.1., where you tabulated each of the variables, using a foreach loop.

. foreach var in P425r A172 SexHRP Gorx A049r {
  2. tab `var'
  3. }

  Main source │
 of household │
       income │
    (recoded) │      Freq.     Percent        Cum.
──────────────┼───────────────────────────────────
Earned income │      2,841       55.23       55.23
 Other income │      2,303       44.77      100.00
──────────────┼───────────────────────────────────
        Total │      5,144      100.00

    Internet │
  connection │
in household │      Freq.     Percent        Cum.
─────────────┼───────────────────────────────────
Has internet │      4,232       82.27       82.27
 No internet │        912       17.73      100.00
─────────────┼───────────────────────────────────
       Total │      5,144      100.00

     Sex of │
  Household │
  Reference │
     Person │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
       Male │      3,156       61.35       61.35
     Female │      1,988       38.65      100.00
────────────┼───────────────────────────────────
      Total │      5,144      100.00

 Government Office Region │
                 modified │      Freq.     Percent        Cum.
──────────────────────────┼───────────────────────────────────
               North East │        251        4.88        4.88
North West and Merseyside │        585       11.37       16.25
 Yorkshire and the Humber │        462        8.98       25.23
            East Midlands │        424        8.24       33.48
            West Midlands │        526       10.23       43.70
                  Eastern │        497        9.66       53.36
                   London │        480        9.33       62.69
               South East │        681       13.24       75.93
               South West │        429        8.34       84.27
                    Wales │        246        4.78       89.06
                 Scotland │        412        8.01       97.06
         Northern Ireland │        151        2.94      100.00
──────────────────────────┼───────────────────────────────────
                    Total │      5,144      100.00

 Household size; │
       number of │
   persons in HH │
       (recoded) │      Freq.     Percent        Cum.
─────────────────┼───────────────────────────────────
        1 person │      1,434       27.88       27.88
        2 person │      1,926       37.44       65.32
        3 person │        779       15.14       80.46
        4 person │        670       13.02       93.49
5 person or more │        335        6.51      100.00
─────────────────┼───────────────────────────────────
           Total │      5,144      100.00

6.2. Using a forvalues loop, create 5 dummy variables (hhs1,…,hhs5), each indicating one of the household size values (hhsize). Check that each new variable takes on the right values by including the following in your loop: tab hhs1 hhsize, miss.

. forvalues i = 1/5 {
  2. gen hhs`i' = hhsize==`i'
  3. tab hhs`i' hhsize, miss
  4. }    

           │     Household size, number of people in household
           │                (recoded)formerly A049r
      hhs1 │         1          2          3          4          5 │     Total
───────────┼───────────────────────────────────────────────────────┼──────────
         0 │         0      1,926        779        670        335 │     3,710 
         1 │     1,434          0          0          0          0 │     1,434 
───────────┼───────────────────────────────────────────────────────┼──────────
     Total │     1,434      1,926        779        670        335 │     5,144 

           │     Household size, number of people in household
           │                (recoded)formerly A049r
      hhs2 │         1          2          3          4          5 │     Total
───────────┼───────────────────────────────────────────────────────┼──────────
         0 │     1,434          0        779        670        335 │     3,218 
         1 │         0      1,926          0          0          0 │     1,926 
───────────┼───────────────────────────────────────────────────────┼──────────
     Total │     1,434      1,926        779        670        335 │     5,144 

           │     Household size, number of people in household
           │                (recoded)formerly A049r
      hhs3 │         1          2          3          4          5 │     Total
───────────┼───────────────────────────────────────────────────────┼──────────
         0 │     1,434      1,926          0        670        335 │     4,365 
         1 │         0          0        779          0          0 │       779 
───────────┼───────────────────────────────────────────────────────┼──────────
     Total │     1,434      1,926        779        670        335 │     5,144 

           │     Household size, number of people in household
           │                (recoded)formerly A049r
      hhs4 │         1          2          3          4          5 │     Total
───────────┼───────────────────────────────────────────────────────┼──────────
         0 │     1,434      1,926        779          0        335 │     4,474 
         1 │         0          0          0        670          0 │       670 
───────────┼───────────────────────────────────────────────────────┼──────────
     Total │     1,434      1,926        779        670        335 │     5,144 

           │     Household size, number of people in household
           │                (recoded)formerly A049r
      hhs5 │         1          2          3          4          5 │     Total
───────────┼───────────────────────────────────────────────────────┼──────────
         0 │     1,434      1,926        779        670          0 │     4,809 
         1 │         0          0          0          0        335 │       335 
───────────┼───────────────────────────────────────────────────────┼──────────
     Total │     1,434      1,926        779        670        335 │     5,144 

Part 7: Partitioned regression and dummy variable projections (Metrics B only)

7.1 Demonstrate partitioned regression result

. _pctile P344pr, p(99)

. gen inc = P344pr if P344pr < r(r1)
(785 missing values generated)

. 
. est clear

. eststo: reg exp inc 

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(1, 4277)      =   2854.45
       Model │  87293780.3         1  87293780.3   Prob > F        =    0.0000
    Residual │   130797888     4,277  30581.6899   R-squared       =    0.4003
─────────────┼──────────────────────────────────   Adj R-squared   =    0.4001
       Total │   218091668     4,278  50979.8196   Root MSE        =    174.88

─────────────┬────────────────────────────────────────────────────────────────
         exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         inc │   .4941574   .0092492    53.43   0.000     .4760242    .5122907
       _cons │   147.1124   5.450921    26.99   0.000     136.4258    157.7991
─────────────┴────────────────────────────────────────────────────────────────
(est1 stored)

. eststo: reg exp inc hhsize

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(2, 4276)      =   1606.17
       Model │  93556576.4         2  46778288.2   Prob > F        =    0.0000
    Residual │   124535092     4,276  29124.2029   R-squared       =    0.4290
─────────────┼──────────────────────────────────   Adj R-squared   =    0.4287
       Total │   218091668     4,278  50979.8196   Root MSE        =    170.66

─────────────┬────────────────────────────────────────────────────────────────
         exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         inc │   .4322739   .0099639    43.38   0.000     .4127395    .4518084
      hhsize │   36.41656   2.483372    14.66   0.000     31.54786    41.28526
       _cons │   99.33041   6.238092    15.92   0.000     87.10051    111.5603
─────────────┴────────────────────────────────────────────────────────────────
(est2 stored)

Next, create residuals from regressing inc and exp against hhsize

. reg inc hhsize

      Source │       SS           df       MS      Number of obs   =     4,359
─────────────┼──────────────────────────────────   F(1, 4357)      =    942.99
       Model │  65502874.6         1  65502874.6   Prob > F        =    0.0000
    Residual │   302651300     4,357  69463.2316   R-squared       =    0.1779
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1777
       Total │   368154175     4,358  84477.7821   Root MSE        =    263.56

─────────────┬────────────────────────────────────────────────────────────────
         inc │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
      hhsize │   105.2338   3.426909    30.71   0.000     98.51532    111.9523
       _cons │   287.5106   8.523331    33.73   0.000     270.8005    304.2206
─────────────┴────────────────────────────────────────────────────────────────

. predict double M2inc, resid
(785 missing values generated)

. 
. reg exp hhsize

      Source │       SS           df       MS      Number of obs   =     4,888
─────────────┼──────────────────────────────────   F(1, 4886)      =   1269.03
       Model │  63789382.1         1  63789382.1   Prob > F        =    0.0000
    Residual │   245599961     4,886  50266.0583   R-squared       =    0.2062
─────────────┼──────────────────────────────────   Adj R-squared   =    0.2060
       Total │   309389343     4,887   63308.644   Root MSE        =     224.2

─────────────┬────────────────────────────────────────────────────────────────
         exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
      hhsize │   96.62228   2.712317    35.62   0.000     91.30492    101.9396
       _cons │   222.3487   6.983517    31.84   0.000     208.6578    236.0395
─────────────┴────────────────────────────────────────────────────────────────

. predict double M2exp, resid
(256 missing values generated)

Estimate model with residualized variables

. eststo: reg exp M2inc

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(1, 4277)      =   1440.59
       Model │  54949821.1         1  54949821.1   Prob > F        =    0.0000
    Residual │   163141847     4,277  38143.9904   R-squared       =    0.2520
─────────────┼──────────────────────────────────   Adj R-squared   =    0.2518
       Total │   218091668     4,278  50979.8196   Root MSE        =     195.3

─────────────┬────────────────────────────────────────────────────────────────
         exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
       M2inc │   .4327982   .0114029    37.96   0.000     .4104426    .4551538
       _cons │   402.5694   2.985992   134.82   0.000     396.7153    408.4235
─────────────┴────────────────────────────────────────────────────────────────
(est3 stored)

. eststo: reg M2exp M2inc

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(1, 4277)      =   1863.15
       Model │  54792875.5         1  54792875.5   Prob > F        =    0.0000
    Residual │   125781329     4,277  29408.7745   R-squared       =    0.3034
─────────────┼──────────────────────────────────   Adj R-squared   =    0.3033
       Total │   180574204     4,278  42209.9589   Root MSE        =    171.49

─────────────┬────────────────────────────────────────────────────────────────
       M2exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
       M2inc │   .4321797   .0100125    43.16   0.000     .4125501    .4518093
       _cons │  -30.88738   2.621887   -11.78   0.000    -36.02764   -25.74712
─────────────┴────────────────────────────────────────────────────────────────
(est4 stored)

. eststo: reg M2exp inc

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(1, 4277)      =   1175.17
       Model │  38921396.6         1  38921396.6   Prob > F        =    0.0000
    Residual │   141652808     4,277  33119.6651   R-squared       =    0.2155
─────────────┼──────────────────────────────────   Adj R-squared   =    0.2154
       Total │   180574204     4,278  42209.9589   Root MSE        =    181.99

─────────────┬────────────────────────────────────────────────────────────────
       M2exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         inc │    .329965   .0096254    34.28   0.000     .3110943    .3488357
       _cons │  -202.0139   5.672599   -35.61   0.000    -213.1352   -190.8927
─────────────┴────────────────────────────────────────────────────────────────
(est5 stored)

. 
. esttab, se stat(r2 rss mss F df_m N) rename(M2inc inc)

────────────────────────────────────────────────────────────────────────────────────────────
                      (1)             (2)             (3)             (4)             (5)   
                      exp             exp             exp           M2exp           M2exp   
────────────────────────────────────────────────────────────────────────────────────────────
inc                 0.494***        0.432***        0.433***        0.432***        0.330***
                (0.00925)       (0.00996)        (0.0114)        (0.0100)       (0.00963)   

hhsize                              36.42***                                                
                                  (2.483)                                                   

_cons               147.1***        99.33***        402.6***       -30.89***       -202.0***
                  (5.451)         (6.238)         (2.986)         (2.622)         (5.673)   
────────────────────────────────────────────────────────────────────────────────────────────
r2                  0.400           0.429           0.252           0.303           0.216   
rss           130797887.8     124535091.7     163141847.0     125781328.7     141652807.6   
mss            87293780.3      93556576.4      54949821.1      54792875.5      38921396.6   
F                  2854.4          1606.2          1440.6          1863.1          1175.2   
df_m                    1               2               1               1               1   
N                    4279            4279            4279            4279            4279   
────────────────────────────────────────────────────────────────────────────────────────────
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Models 2,3,4 yield the same coefficient (up to some rounding error). What about the SE’s? Between 3 and 4, which gives you the correct projection and SE’s? In model 3, we have artificially inflated the amount of variation in the outcome by failing to residualize it. The estimator has a higher SE, as the residual variation is higher compared to 2 and 4.

Question: should the other model stats be the same? For example, R2?

7.2 Demonstrate equality of projections with base category.

. tab P425r, gen(inc_src)

  Main source │
 of household │
       income │
    (recoded) │      Freq.     Percent        Cum.
──────────────┼───────────────────────────────────
Earned income │      2,841       55.23       55.23
 Other income │      2,303       44.77      100.00
──────────────┼───────────────────────────────────
        Total │      5,144      100.00

. 
. est clear

. eststo: reg exp inc inc_src2

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(2, 4276)      =   1438.20
       Model │  87707427.2         2  43853713.6   Prob > F        =    0.0000
    Residual │   130384241     4,276   30492.105   R-squared       =    0.4022
─────────────┼──────────────────────────────────   Adj R-squared   =    0.4019
       Total │   218091668     4,278  50979.8196   Root MSE        =    174.62

─────────────┬────────────────────────────────────────────────────────────────
         exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         inc │   .4729693   .0108807    43.47   0.000     .4516374    .4943012
    inc_src2 │  -23.17541   6.292255    -3.68   0.000     -35.5115   -10.83933
       _cons │   169.8991   8.240186    20.62   0.000      153.744    186.0541
─────────────┴────────────────────────────────────────────────────────────────
(est1 stored)

. eststo: reg exp inc inc_src1

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(2, 4276)      =   1438.20
       Model │  87707427.2         2  43853713.6   Prob > F        =    0.0000
    Residual │   130384241     4,276   30492.105   R-squared       =    0.4022
─────────────┼──────────────────────────────────   Adj R-squared   =    0.4019
       Total │   218091668     4,278  50979.8196   Root MSE        =    174.62

─────────────┬────────────────────────────────────────────────────────────────
         exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         inc │   .4729693   .0108807    43.47   0.000     .4516374    .4943012
    inc_src1 │   23.17541   6.292255     3.68   0.000     10.83933     35.5115
       _cons │   146.7236   5.443954    26.95   0.000     136.0507    157.3966
─────────────┴────────────────────────────────────────────────────────────────
(est2 stored)

. eststo: reg exp inc inc_src1 inc_src2, nocons

      Source │       SS           df       MS      Number of obs   =     4,279
─────────────┼──────────────────────────────────   F(3, 4276)      =   8477.17
       Model │   775460099         3   258486700   Prob > F        =    0.0000
    Residual │   130384241     4,276   30492.105   R-squared       =    0.8561
─────────────┼──────────────────────────────────   Adj R-squared   =    0.8560
       Total │   905844340     4,279  211695.335   Root MSE        =    174.62

─────────────┬────────────────────────────────────────────────────────────────
         exp │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         inc │   .4729693   .0108807    43.47   0.000     .4516374    .4943012
    inc_src1 │   169.8991   8.240186    20.62   0.000      153.744    186.0541
    inc_src2 │   146.7236   5.443954    26.95   0.000     136.0507    157.3966
─────────────┴────────────────────────────────────────────────────────────────
(est3 stored)

. 
. esttab, se stat(r2 rss mss F df_m N)

────────────────────────────────────────────────────────────
                      (1)             (2)             (3)   
                      exp             exp             exp   
────────────────────────────────────────────────────────────
inc                 0.473***        0.473***        0.473***
                 (0.0109)        (0.0109)        (0.0109)   

inc_src2           -23.18***                        146.7***
                  (6.292)                         (5.444)   

inc_src1                            23.18***        169.9***
                                  (6.292)         (8.240)   

_cons               169.9***        146.7***                
                  (8.240)         (5.444)                   
────────────────────────────────────────────────────────────
r2                  0.402           0.402           0.856   
rss           130384240.9     130384240.9     130384240.9   
mss            87707427.2      87707427.2     775460098.6   
F                  1438.2          1438.2          8477.2   
df_m                    2               2               3   
N                    4279            4279            4279   
────────────────────────────────────────────────────────────
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

These are essentially the same models and regardless of choice on base category the coefficient (and SE) on inc does not change. Thus, M2 projection is the same in all 3 models. They have the same RSS. You will notice that Stata’s R2 and F-stat are different for model 3. This is because Stata seems to compute a different MSS (or ESS) for model 3. It’s not clear why, by likely has to do with the absence of a constant. If you examine the predicted values, they all have the same variance; suggesting the ESS should be the same. A mystery to be resolve.

. qui reg exp inc inc_src2

. predict exp_hat1
(option xb assumed; fitted values)
(785 missing values generated)

. qui reg exp inc inc_src1

. predict exp_hat2
(option xb assumed; fitted values)
(785 missing values generated)

. qui reg exp inc inc_src1 inc_src2, nocons

. predict exp_hat3
(option xb assumed; fitted values)
(785 missing values generated)

. 
. sum exp_hat*

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    exp_hat1 │      4,359    403.4237    143.8882   146.7236   730.3263
    exp_hat2 │      4,359    403.4237    143.8882   146.7236   730.3263
    exp_hat3 │      4,359    403.4237    143.8882   146.7236   730.3263

Postamble

Before you close finish up ensure that your do-file includes the follow at the end,

log close

And check that the do-file runs without error, replicating all of your results. You can do this using the “do” icon at the top right of the do-file editing window.

Finally, don’t forget to save changes to your do-file and do NOT save changes to the data when you close Stata.


  1. Stata formatted datasets use the extension “.dta”. You can open a range of other file types in Stata using the ‘import’ command. Go to “File>Import” in the main Stata window.↩︎

  2. Avoid the editor-icon (image of table and pencil) or “Data>Data Editor>Data Editor (Edit)”. It opens the dataset like browse, but allows you to manually edit cell values, just like in an Excel spreadsheet. You want to avoid doing this as the process is not replicable. More accurately, these actions are recorded based on the row-column value of the cell being editted. They are replicable, but dependent on the sorting of observations (i.e., rows) and order of the variables (i.e., columns), which may change.↩︎