16 Oct 2024
This exercise sheet is designed as an introduction to Stata for those studying Econometrics A & B.
Stata is a powerful and reliable statistical analysis software. It is not a language, like R or Python, but does have command-based functionalities. Those who are familiar with Stata’s commands, will effectively be able to programme in scripts called do-files (named for their file extension “.do”). Stata does have its own mathematical (compiled) language, called Mata, which is used in the programming of many commands. It also includes integration for C, C++, R, and Python.
Stata remains the dominant software/language in applied Economics research (see R-bloggers post). This is a product of history (i.e. luck) and some of Stata’s key strengths. It is a software designed primarily for the analysis of cross-sectional and panel datasets, and has grown its market share with the rise in availability of digitized household survey data (beginning in the 1980s). Stata is very efficient at analyzing small-to-medium sized datasets. The efficiency derives from the fact that it loads the data into memory. With modern computers, that typically have >8GB of RAM, you will have no problem analyzing a household survey.
However, this efficiency is also it’s “Achilles’ heel”. Stata is losing market share in the age of “big data”. While there are more efficient ways to analyse big-ger datasets in Stata, it remains bound by the amount of memory available.
Stata’s dominance in empirical research also relates to its peer
reviewed nature. Commands (packages) that come pre-installed in Stata,
along with those installed from Statistical Software Components (using
the command ssc install), are heavily vetted. For this
reason, Stata is used by consultancies that conduct research for legal
cases.
As the dominant research software, it is important that you have a working knowledge of Stata, even if you are fluent in other programming languages. There might be a research assistant opportunity in your future that requires you to work with Stata. For those less confident with ‘programming’, Stata is an easier introduction to data analysis.
In the market today, a wider range of software and programming-language experience is extremely valuable. We encourage you to develop your knowledge of Stata while investing in languages like R and Python.
The small dataset (filename “data.dta”)1 is based on the Living Cost and Food Survey, 2013 (LCF). The LCF, conducted by the Office for National Statistics, collects information on spending patterns and the cost of living that reflects household budgets across the UK. This teaching dataset is a subset which has been subject to certain simplifications and additions, for learning and teaching by CMIST, Manchester.
This problem set will review a sample of helpful Stata commands. In addition, a do-file with some comments will be provided along with the exercises. Please work through this exercise in conjunction with the do-file. Some of the commands listed below may seem irrelevant for the analysis of this particular dataset. They have been included because of their usefulness in other settings.
a. Create a folder on your computer where you intend to save this project. For example, “…/EC910/Seminars/Seminar-1”.
b. Download the dataset from Moodle - “data.dta” - and save/move it to the above folder.
c. Open Stata
d. Open a new do-file, either using the do-file icon (image of pen and paper) or “Window>Do-file Editor>New Do-file Editor”.
e. Save the new do-file to the project folder; e.g. “problem-set-1.do”.
f. Give the do file a title, author, and date. The
treat text as a comment, you need to use the symbols *,
//, or /* followed by */. For
example,
. * ──────────────────────────────────────────────────────────────────────────── * . * title: Problem Set 1 . * author: Neil Lloyd . * date: 14 October 2024 . * ──────────────────────────────────────────────────────────────────────────── *
g. Open the dataset “data.dta”. You can go to
“File>Open”, then locate the dataset on your computer and then select
“Open”. However, doing so is not replicable. Instead, you can use the
use command from within your do-file.
Add the following line to your do-file:
use "filepath/data.dta", clear
subsituting in the word “filepath” with the full filepath of your
data. Alternatively, you can set a default directory for the project
using cd. If set, this will become the default directory
where Stata searches for a file or saves a file.
cd "filepath"
use data.dta, clear
Note, double quotations over the file path/name are only necessary when the path/name contains blank spaces; e.g. “problem set 1 data.dta”. This is one reason to avoid spaces in names.
Second, the option clear tells Stata to remove any
existing dataset from memory before opening this one. You cannot open
two datasets simultaneously. Any changes to the existing dataset will be
lost unless saved. But this is not necessarily a bad thing. Read on.
h. Open a log file in which to record the output
from Stata. If you have set the default directory using cd,
you can do so as follows:
log using problem-set-1-log, replace
This will save the file using Stata’s own log-file format (file extension “.scml”). It’s better to save them as a simple text-file:
log using problem-set-1-log, replace text
or
log using problem-set-1-log.txt, replace
In this way, they open in a any text editor, like Notepad.
When you open a log-file, it will record all output (in the output
window), until you tell it to stop: log close. This should
appear at the end of your do-file.
As a suggestion, including the following before you open a log-file:
cap log close
log using problem-set-1-log.txt, replace
followed by,
log close
at the end. We won’t explain why, but it will help you avoid an annoying feature of how log-files and do-files interact.
When you open a dataset in Stata, the software creates a copy in memory. Any changes you make to this copy will not affect the version you have saved on your harddrive, UNLESS you save those changes. It is a good principle to not save changes to the dataset, but rather to keep a copy of all changes made as a list of commands in a saved do-file. This ensures replicability.
If you want to create a new dataset, save it under a new name. The command below will do just this.
save new-data.dta, replace
When you close Stata it will ask you if you want to save changes to the opened dataset. Unless you have a clear reason to so, SAY NO. Ensure that your do-file is saved, well organized, and well commented. As a new feature of Stata 18, the software keeps a back-up of unsaved changes to the do-file (with file extension “.stswp”) in case the software or your computer crashes.
1.1. Review the data. You can either use the browse-icon (image of table and magnifying glass) or “Data>Data Editor>Data Editor (Browse)”.2 Else, in your do-file (or the main window’s command line) type,
. browse
or
. br
Stata does not mind if you shorten the command, so long as there is no ambiguity.
browse is not a great command to keep in a do-file. You
want your do-file to run like a programme and browse means
it will open up a visualization of the data each time you run the file,
thereby slowing down future operations.
1.2. Use the describe (or simply
de) command to learn about the variables in the dataset. Do
you know what any of this information means?
. des
Contains data from C:\Users\neil_\OneDrive - University of Warwick\Documents\EC910\seminars\ps1\data.dta
Observations: 5,144
Variables: 14 20 Sep 2020 13:54
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Variable Storage Display Value
name type format label Variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
casenew int %8.0g New random ID number
P550tpr double %12.0g Total expenditure (top coded)
P344pr double %12.0g Gross normal weekly household income (top coded)
P425r byte %8.0g P425r Main source of household income (recoded)
A172 byte %8.0g A172 Internet connection in household
A093r byte %8.0g A093r Economic position of Household Reference Person (recoded)
A094r byte %8.0g A094r NS-SEC 3 Class of Household Reference Person (recoded from NS-SEC 12)
A121r byte %8.0g A121r Tenure - type (recoded)
SexHRP byte %8.0g SexHRP Sex of Household Reference Person
A049r byte %8.0g A049r Household size; number of persons in HH (recoded)
G018r byte %8.0g G018r Number of adults (recoded)
G019r byte %8.0g G019r Number of children (recoded)
Gorx byte %8.0g Gorx Government Office Region modified
hhsize byte %8.0g Household size, number of people in household (recoded)formerly A049r
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
1.3. Use the summarize (or simply
su) to recover some basic summary statistics of the
variable P550tpr.
. sum P550tpr
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
P550tpr │ 5,144 479.7584 292.3652 30.525 1175
1.4. Use the detail option to learn
more about the variable P344pr. Are there any abnormal
values or outliers?
. sum P344pr, det
Gross normal weekly household income (top coded)
─────────────────────────────────────────────────────────────
Percentiles Smallest
1% 56 0
5% 138.1 0
10% 180.03 0 Obs 5,144
25% 314.0238 0 Sum of wgt. 5,144
50% 563.15 Mean 620.4336
Largest Std. dev. 359.1557
75% 928.087 1184.99
90% 1184.99 1184.99 Variance 128992.8
95% 1184.99 1184.99 Skewness .2958553
99% 1184.99 1184.99 Kurtosis 1.802326
1.5. The summarize command doesn’t tell
you anything about missing data. Check to see if the number of
observations above matches the total observations in the dataset using
count. Was there any missing data?
. count 5,144
1.6. List the first 10 values of variables
P550tpr and P344pr in the dataset using the
command `list’.
. list P550tpr P344pr in 1/10
┌───────────────────────┐
│ P550tpr P344pr │
├───────────────────────┤
1. │ 255.635 534.39 │
2. │ 305.9577537 781.14 │
3. │ 240.2902021 445.62 │
4. │ 501.4998985 778.4 │
5. │ 465.2381126 643.84 │
├───────────────────────┤
6. │ 524.1410067 1094.18 │
7. │ 255.2775307 525.73 │
8. │ 269.2049452 200 │
9. │ 413.7603846 556.6 │
10. │ 591.3247581 1105.6 │
└───────────────────────┘
1.7. Make a simple frequency table of the values in
A121r using the command tabulate.
. tab A121r
Tenure - type │
(recoded) │ Freq. Percent Cum.
───────────────┼───────────────────────────────────
Public rented │ 880 17.11 17.11
Private rented │ 798 15.51 32.62
Owned │ 3,466 67.38 100.00
───────────────┼───────────────────────────────────
Total │ 5,144 100.00
Here are a list of other commands to help you navigate and learn about the dataset:
lookforcodebook2.1. Make a histogram depicting the frequency of
values in the variable P550tpr. Try make it so that the
y-axis is in percentage-points and not density (Stata’s default). To
check the options of a command, type: help histogram.
. hist P550tpr, percent (bin=37, start=30.525, width=30.931757)
2.2. Modify the above graph so that the width of each bin is £10. Which graph is more informative?
. hist P550tpr, percent width(10) (bin=115, start=30.525, width=10)
2.3. Use Stata’s flexible twoway graph
function to create a scatter plot of of the relationship bewteen
P550tpr and P344pr.
. scatter P550tpr P344pr
2.4. Overlay the above scatter plot with a line
graph of the linear fit between the same two variables. Hint: use the
graphing command lfit. If you can make the colour of the
fitted line red.
. twoway (scatter P550tpr P344pr) (lfit P550tpr P344pr, lc(red)), legend(pos(6) row(1))
2.5. Replicate 2.3., weighting each observation by
household size. That is, give more weight to larger households. Add a
note to the graph that explains this weighting. Play around with the
marker fill and outline colours so that the graph is more readable. For
example, you can adjust the opacity of the fill colour of each marker by
including the option: mfc(blue%20).
. scatter P550tpr P344pr [w = hhsize], mfc(teal%10) mlc(teal) note("Weighted by household size")
(analytic weights assumed)
(analytic weights assumed)
(analytic weights assumed)
In Stata you the common command for creating a new variables is
generate. There are other operators, like egen
and recode, that have more specific functions.
3.1. Use the gen command to create a
new variable equal to the consumption expenditure per pound of income
variable apc=P550tpr/P344pr.
Next, label the variable “Average Propensity to Consume”. Hint: type
help label. To check that the new label has been applied,
type describe apc to check that it has worked.
. gen apc = P550tpr/P344pr
(12 missing values generated)
. lab var apc "Average Propensity to Consume"
. des apc
Variable Storage Display Value
name type format label Variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
apc float %9.0g Average Propensity to Consume
3.2. The variable p550tpr contains some
very strange values at the upper end of the distribution: type
sum P550tpr, det. Use the command
_pctile p550tpr, p(99) in order to find the value above
which 1% of points of P550tpr lie.
. sum P550tpr, det
Total expenditure (top coded)
─────────────────────────────────────────────────────────────
Percentiles Smallest
1% 64.05 30.525
5% 111.0905 31.952
10% 153.5125 32.13598 Obs 5,144
25% 254.0006 32.355 Sum of wgt. 5,144
50% 419.9035 Mean 479.7584
Largest Std. dev. 292.3652
75% 645.0512 1175
90% 934.4652 1175 Variance 85477.43
95% 1171.163 1175 Skewness .8262061
99% 1175 1175 Kurtosis 2.937132
. _pctile P550tpr, p(99)
3.3. Use the above information to create a new
variable (called exp) that is a duplicate of
P55tpr, but replaces the top 1% of values to missing. In
Stata missing values for numerical variables take the value “.”. You can
also use {“.a”, “.b”,…} if you want to assign different categories of
missing. For string variables, a missing value is just an empty string:
““.
Try to avoid using explicit numerical values in your code. For
example, instead of copying the 99th percentile from question 14, use
the Stata’s stored value. The command _pctile is an r-class
command (many estimation commands are e-class). You can see the stored
values with: return list.
. return list
scalars:
r(r1) = 1175
. gen exp = P550tpr if P550tpr < r(r1)
(256 missing values generated)
.
. * alternatively,
. gen exp2 = P550tpr
. replace exp2 = . if P550tpr >= r(r1)
(256 real changes made, 256 to missing)
. compare exp exp2
────────── Difference ──────────
Count Minimum Average Maximum
────────────────────────────────────────────────────────────────────────
exp=exp2 4888
──────────
Jointly defined 4888 0 0 0
Jointly missing 256
──────────
Total 5144
.
. * check:
. sum exp, det
exp
─────────────────────────────────────────────────────────────
Percentiles Smallest
1% 63.52271 30.525
5% 109.0927 31.952
10% 149.5541 32.13598 Obs 4,888
25% 245.8819 32.355 Sum of wgt. 4,888
50% 402.132 Mean 443.3464
Largest Std. dev. 251.6121
75% 594.9666 1168.479
90% 810.3289 1169.66 Variance 63308.64
95% 942.6381 1171.163 Skewness .6960889
99% 1108.645 1172.356 Kurtosis 2.857423
3.4. Produce a suitably labelled histogram of
exp.
. hist exp, xtitle(Expenditure) percent fc(maroon%50) lc(maroon) width(10) xlabel(0(100)1200) xmtick(0(10)1200) (bin=115, start=30.525, width=10)
3.5. Using the approach from 3.3, replace the top 1%
of values in apc to missing.
. _pctile apc, p(99)
. return list
scalars:
r(r1) = 3.661458730697632
. replace apc = . if apc>=r(r1)
(52 real changes made, 52 to missing)
3.6. Produce suitably labelled
horizontal histogram of apc and export this graph as a
.pdf file. Hint: help graph export.
. hist apc, xtitle(Propensity to consume) percent horizontal width(0.1) ymtick(0(0.1)4) fc(gold%50) lc(gold) (bin=36, start=.1207175, width=.1) . graph export hist-apc.pdf, replace file hist-apc.pdf saved as PDF format
4.1. Use the tabulate and/or
table commands to learn about the frequency distributions
of P425r, A172, SexHRP,
Gorx, and A049r. Note, if you you can also
cross-tabulate two variables; for example,
tab P435r A172.
. tab P425r
Main source │
of household │
income │
(recoded) │ Freq. Percent Cum.
──────────────┼───────────────────────────────────
Earned income │ 2,841 55.23 55.23
Other income │ 2,303 44.77 100.00
──────────────┼───────────────────────────────────
Total │ 5,144 100.00
. tab A172
Internet │
connection │
in │
household │ Freq. Percent Cum.
────────────┼───────────────────────────────────
Yes │ 4,232 82.27 82.27
No │ 912 17.73 100.00
────────────┼───────────────────────────────────
Total │ 5,144 100.00
. tab SexHRP
Sex of │
Household │
Reference │
Person │ Freq. Percent Cum.
────────────┼───────────────────────────────────
Male │ 3,156 61.35 61.35
Female │ 1,988 38.65 100.00
────────────┼───────────────────────────────────
Total │ 5,144 100.00
. tab Gorx
Government Office Region │
modified │ Freq. Percent Cum.
──────────────────────────┼───────────────────────────────────
North East │ 251 4.88 4.88
North West and Merseyside │ 585 11.37 16.25
Yorkshire and the Humber │ 462 8.98 25.23
East Midlands │ 424 8.24 33.48
West Midlands │ 526 10.23 43.70
Eastern │ 497 9.66 53.36
London │ 480 9.33 62.69
South East │ 681 13.24 75.93
South West │ 429 8.34 84.27
Wales │ 246 4.78 89.06
Scotland │ 412 8.01 97.06
Northern Ireland │ 151 2.94 100.00
──────────────────────────┼───────────────────────────────────
Total │ 5,144 100.00
. tab A049r
Household size; │
number of │
persons in HH │
(recoded) │ Freq. Percent Cum.
─────────────────┼───────────────────────────────────
1 person │ 1,434 27.88 27.88
2 person │ 1,926 37.44 65.32
3 person │ 779 15.14 80.46
4 person │ 670 13.02 93.49
5 person or more │ 335 6.51 100.00
─────────────────┼───────────────────────────────────
Total │ 5,144 100.00
4.2. Report summary statistics for the variable
exp separately by main source of income
(P425r). Since, exp is a continuous variable
you do not want to use tabulate. Instead, use the
summarize command. This can be combined with a categorical
in a few ways: (1) tab catvar, sum(continvar);
bysort catvar: sum continvar, det; (3)
table catvar, stat(mean continvar sd continvar count continvar).
The table option is the most flexible, as there is a wider
range of statistics available; including percentiles.
. tab P425r, sum(exp)
Main source │
of │
household │
income │ Summary of exp
(recoded) │ Mean Std. dev. Freq.
────────────┼────────────────────────────────────
Earned in │ 540.39831 237.40254 2,637
Other inc │ 329.65214 217.95931 2,251
────────────┼────────────────────────────────────
Total │ 443.34642 251.61209 4,888
. bysort P425r: sum exp, det
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Earned i
exp
─────────────────────────────────────────────────────────────
Percentiles Smallest
1% 125.4465 38.945
5% 211.6911 49.90686
10% 252.5987 62.945 Obs 2,637
25% 362.783 64.01518 Sum of wgt. 2,637
50% 503.3182 Mean 540.3983
Largest Std. dev. 237.4025
75% 688.3661 1168.479
90% 891.5864 1169.66 Variance 56359.97
95% 997.3179 1171.163 Skewness .5331345
99% 1126.041 1172.356 Kurtosis 2.665763
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Other in
exp
─────────────────────────────────────────────────────────────
Percentiles Smallest
1% 52.56 30.525
5% 84.535 31.952
10% 107.92 32.13598 Obs 2,251
25% 169.7276 32.355 Sum of wgt. 2,251
50% 271.4606 Mean 329.6521
Largest Std. dev. 217.9593
75% 432.3063 1147.506
90% 646.2236 1154.613 Variance 47506.26
95% 780.106 1159.285 Skewness 1.265761
99% 1018.757 1161.778 Kurtosis 4.424679
. table P425r, stat(mean exp) stat(sd exp) stat(count exp)
──────────────────────────────────────────┬─────────────────────────────────────────────────────────────
│ Mean Standard deviation Number of nonmissing values
──────────────────────────────────────────┼─────────────────────────────────────────────────────────────
Main source of household income (recoded) │
Earned income │ 540.3983 237.4025 2,637
Other income │ 329.6521 217.9593 2,251
Total │ 443.3464 251.6121 4,888
──────────────────────────────────────────┴─────────────────────────────────────────────────────────────
4.3. Produce a suitably labelled histogram of
exp separately for households according to their main
source of income (P425r).
. hist exp, by(P425r, note("") title("Distribution of expenditure by main source of income")) percent xtitle(Expenditure) lc(eltgreen) fc(eltgreen%50)
4.4. Construct summary statistics for
exp for households according to both the main source of
income (P425r), and internet connection
(A172). Here, it is best to use the table
command.
. table P425r A172, stat(mean exp)
──────────────────────────────────────────┬──────────────────────────────────────
│ Internet connection in household
│ Yes No Total
──────────────────────────────────────────┼──────────────────────────────────────
Main source of household income (recoded) │
Earned income │ 554.3448 321.5789 540.3983
Other income │ 389.7849 209.0654 329.6521
Total │ 492.2577 228.6654 443.3464
──────────────────────────────────────────┴──────────────────────────────────────
.
. * alternatively,
. bys P425r A172 : sum exp
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Earned i, A172 = Yes
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp │ 2,479 554.3448 234.2653 38.945 1172.356
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Earned i, A172 = No
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp │ 158 321.5789 169.7532 64.01518 1034.258
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Other in, A172 = Yes
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp │ 1,502 389.7849 226.8636 45.58 1161.778
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> P425r = Other in, A172 = No
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp │ 749 209.0654 133.4172 30.525 982.8839
4.5. Export an the table from 4.2. to an Excel
spreadsheet using dtable.
. dtable exp, export(sumstats, as(xlsx) replace) by(P425r)
─────────────────────────────────────────────────────────
Main source of household income (recoded)
Earned income Other income Total
─────────────────────────────────────────────────────────
N 2,841 (55.2%) 2,303 (44.8%) 5,144 (100.0%)
exp 540.398 (237.403) 329.652 (217.959) 443.346 (251.612)
─────────────────────────────────────────────────────────
(collection DTable exported to file sumstats.xlsx)
5.1. At the 1% significance level, test the
hypothesis of no difference in exp between households with
earnings and other sources as their main source of income
(P425r). Hint: help ttest. Do consider the
parameters of the test: one-side or two-sided, equal or unequal
variance, significance level.
. ttest exp, by(P425r) unequal level(99)
Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
Group │ Obs Mean Std. err. Std. dev. [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Earned i │ 2,637 540.3983 4.623068 237.4025 528.4814 552.3152
Other in │ 2,251 329.6521 4.593965 217.9593 317.8088 341.4955
─────────┼────────────────────────────────────────────────────────────────────
Combined │ 4,888 443.3464 3.598868 251.6121 434.0727 452.6201
─────────┼────────────────────────────────────────────────────────────────────
diff │ 210.7462 6.517459 193.9517 227.5406
─────────┴────────────────────────────────────────────────────────────────────
diff = mean(Earned i) - mean(Other in) t = 32.3356
H0: diff = 0 Satterthwaite's degrees of freedom = 4860.16
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
5.2. At the 1% significance level, test the
hypothesis of no differences in exp between households with
earnings and other sources as their main source of income
(P425r) by internet connection (A172).
. bys A172 : ttest exp, by(P425r) unequal level(99)
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> A172 = Yes
Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
Group │ Obs Mean Std. err. Std. dev. [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Earned i │ 2,479 554.3448 4.705109 234.2653 542.2159 566.4737
Other in │ 1,502 389.7849 5.853691 226.8636 374.6876 404.8822
─────────┼────────────────────────────────────────────────────────────────────
Combined │ 3,981 492.2577 3.880372 244.8327 482.2577 502.2577
─────────┼────────────────────────────────────────────────────────────────────
diff │ 164.5599 7.510243 145.2034 183.9164
─────────┴────────────────────────────────────────────────────────────────────
diff = mean(Earned i) - mean(Other in) t = 21.9114
H0: diff = 0 Satterthwaite's degrees of freedom = 3246.26
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> A172 = No
Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
Group │ Obs Mean Std. err. Std. dev. [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Earned i │ 158 321.5789 13.50484 169.7532 286.3649 356.7928
Other in │ 749 209.0654 4.87496 133.4172 196.4763 221.6546
─────────┼────────────────────────────────────────────────────────────────────
Combined │ 907 228.6654 4.870152 146.6717 216.0942 241.2365
─────────┼────────────────────────────────────────────────────────────────────
diff │ 112.5134 14.35779 75.17385 149.853
─────────┴────────────────────────────────────────────────────────────────────
diff = mean(Earned i) - mean(Other in) t = 7.8364
H0: diff = 0 Satterthwaite's degrees of freedom = 199.869
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
5.3. From the variable P344tp create a
suitably labelled binary variable inc_m which is =1 if
above median income, and =0 otherwise. Label the values 0 and 1 within
this variable.
. _pctile P344pr, p(50) /*50 says it is the median that is being saved in r(r1) */
. gen inc_m=(P344pr>`r(r1)') /* creates a dummy for those observations above the median */
. lab var inc_m "Above median income"
. tab inc_m
Above │
median │
income │ Freq. Percent Cum.
────────────┼───────────────────────────────────
0 │ 2,572 50.00 50.00
1 │ 2,572 50.00 100.00
────────────┼───────────────────────────────────
Total │ 5,144 100.00
. lab def inc_m 0 "Below median" 1 "above median"
. lab val inc_m inc_m
. tab inc_m
Above median │
income │ Freq. Percent Cum.
─────────────┼───────────────────────────────────
Below median │ 2,572 50.00 50.00
above median │ 2,572 50.00 100.00
─────────────┼───────────────────────────────────
Total │ 5,144 100.00
5.4. At the 1% significance level, test for a
difference in the mean expenditure (exp) of those above and
below median income.
. ttest exp, by(inc_m) unequal level(99)
Two-sample t test with unequal variances
─────────┬────────────────────────────────────────────────────────────────────
Group │ Obs Mean Std. err. Std. dev. [99% conf. interval]
─────────┼────────────────────────────────────────────────────────────────────
Below me │ 2,560 299.5183 3.491319 176.6483 290.5186 308.5181
above me │ 2,328 601.5079 4.673948 225.5149 589.4587 613.5571
─────────┼────────────────────────────────────────────────────────────────────
Combined │ 4,888 443.3464 3.598868 251.6121 434.0727 452.6201
─────────┼────────────────────────────────────────────────────────────────────
diff │ -301.9895 5.83396 -317.0233 -286.9557
─────────┴────────────────────────────────────────────────────────────────────
diff = mean(Below me) - mean(above me) t = -51.7641
H0: diff = 0 Satterthwaite's degrees of freedom = 4402.02
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
5.5. From the variable P344tp create a
categorical variable inc_cat based on the quintiles of
P344tp. Label the variableinc_cat and the 5 values within
this variable. You can either do this manually or explore
help _pctile.
. * manual approach
. _pctile P344pr, p(20, 40, 60, 80)
. gen inc_cat=1 if( P344pr<`r(r1)')
(4,116 missing values generated)
. replace inc_cat=2 if(P344pr>=`r(r1)' & P344pr<`r(r2)')
(1,029 real changes made)
. replace inc_cat=3 if(P344pr>=`r(r2)' & P344pr<`r(r3)')
(1,029 real changes made)
. replace inc_cat=4 if(P344pr>=`r(r3)' & P344pr<`r(r4)')
(1,029 real changes made)
. replace inc_cat=5 if(P344pr>=`r(r4)')
(1,029 real changes made)
. tab inc_cat
inc_cat │ Freq. Percent Cum.
────────────┼───────────────────────────────────
1 │ 1,028 19.98 19.98
2 │ 1,029 20.00 39.99
3 │ 1,029 20.00 59.99
4 │ 1,029 20.00 80.00
5 │ 1,029 20.00 100.00
────────────┼───────────────────────────────────
Total │ 5,144 100.00
. lab var inc_cat "Income quintile"
. lab def inc_cat 1 "1st" 2 "2nd" 3 "3rd" 4 "4th" 5 "5th"
. lab val inc_cat inc_cat
. tab inc_cat
Income │
quintile │ Freq. Percent Cum.
────────────┼───────────────────────────────────
1st │ 1,028 19.98 19.98
2nd │ 1,029 20.00 39.99
3rd │ 1,029 20.00 59.99
4th │ 1,029 20.00 80.00
5th │ 1,029 20.00 100.00
────────────┼───────────────────────────────────
Total │ 5,144 100.00
.
. * xtile option
. xtile inc_cat2 = P344pr, n(5)
.
. * check they are the same
. tab inc_cat inc_cat2, m
Income │ 5 quantiles of P344pr
quintile │ 1 2 3 4 5 │ Total
───────────┼───────────────────────────────────────────────────────┼──────────
1st │ 1,028 0 0 0 0 │ 1,028
2nd │ 1 1,028 0 0 0 │ 1,029
3rd │ 0 1 1,028 0 0 │ 1,029
4th │ 0 0 1 1,028 0 │ 1,029
5th │ 0 0 0 1 1,028 │ 1,029
───────────┼───────────────────────────────────────────────────────┼──────────
Total │ 1,029 1,029 1,029 1,029 1,028 │ 5,144
5.6. Using inc_cat, produce a suitably
labelled plot of the mean of exp for each of the 5 income categories.
Try to use the graph bar function.
. graph bar (mean) exp, over(inc_cat) blabel(bar)
5.7. Produce a plot the mean of exp against
inc_cat, separately by internet connection
(A172).
. label def internet 1 "Has internet" 2 "No internet"
. lab val A172 internet
. graph bar (mean) exp, over(inc_cat) blabel(bar, format(%3.0f)) by(A172, title("Average expenditure by income quintile" "and internet connection") note("")) ytitle(Expenditure) inte
> nsity(*.5) bar(1,color(erose))
. * alternatively,
. graph bar (mean) exp, over(inc_cat) over(A172) blabel(bar, format(%3.0f)) ytitle(Expenditure) title("Average expenditure by income quintile" "and internet connection") intensity(*.
> 5) bar(1,color(emidblue))
A loop is an operator that repeats the same operation through a given
sequence. The three main loops are foreach,
forvalues, and while. Loops can be used to
make code more efficient and also in the design of your own
programme/function (see help program). Within a loop, you
must refer to the running argument using local notation. For example,
the following loop creates a variable that is equal to it’s row value
for rows 1 through 10.
. gen number = .
(5,144 missing values generated)
. forvalues i = 1/10 {
2. dis "Number `i'"
3. replace number = `i' in `i'
4. }
Number 1
(1 real change made)
Number 2
(1 real change made)
Number 3
(1 real change made)
Number 4
(1 real change made)
Number 5
(1 real change made)
Number 6
(1 real change made)
Number 7
(1 real change made)
Number 8
(1 real change made)
Number 9
(1 real change made)
Number 10
(1 real change made)
. list number in 1/10
┌────────┐
│ number │
├────────┤
1. │ 1 │
2. │ 2 │
3. │ 3 │
4. │ 4 │
5. │ 5 │
├────────┤
6. │ 6 │
7. │ 7 │
8. │ 8 │
9. │ 9 │
10. │ 10 │
└────────┘
The curly brackets denote the beginning an end of the actions
included in each loop. You can use the display command like
print() from other programming languages. If you want to
add conditions to a loop, you can do so using if,
else if, and else statements. For example,
. gen str even = ""
(5,144 missing values generated)
. forvalues i = 1/10 {
2. if mod(`i',2) == 0 {
3. dis "Number `i' is even"
4. replace even = "even" in `i'
5. }
6. else {
7. dis "Number `i' is odd"
8. replace even = "odd" in `i'
9. }
10. }
Number 1 is odd
variable even was str1 now str3
(1 real change made)
Number 2 is even
variable even was str3 now str4
(1 real change made)
Number 3 is odd
(1 real change made)
Number 4 is even
(1 real change made)
Number 5 is odd
(1 real change made)
Number 6 is even
(1 real change made)
Number 7 is odd
(1 real change made)
Number 8 is even
(1 real change made)
Number 9 is odd
(1 real change made)
Number 10 is even
(1 real change made)
. list number even in 1/10
┌───────────────┐
│ number even │
├───────────────┤
1. │ 1 odd │
2. │ 2 even │
3. │ 3 odd │
4. │ 4 even │
5. │ 5 odd │
├───────────────┤
6. │ 6 even │
7. │ 7 odd │
8. │ 8 even │
9. │ 9 odd │
10. │ 10 even │
└───────────────┘
6.1. Repeat exercise 4.1., where you tabulated each
of the variables, using a foreach loop.
. foreach var in P425r A172 SexHRP Gorx A049r {
2. tab `var'
3. }
Main source │
of household │
income │
(recoded) │ Freq. Percent Cum.
──────────────┼───────────────────────────────────
Earned income │ 2,841 55.23 55.23
Other income │ 2,303 44.77 100.00
──────────────┼───────────────────────────────────
Total │ 5,144 100.00
Internet │
connection │
in household │ Freq. Percent Cum.
─────────────┼───────────────────────────────────
Has internet │ 4,232 82.27 82.27
No internet │ 912 17.73 100.00
─────────────┼───────────────────────────────────
Total │ 5,144 100.00
Sex of │
Household │
Reference │
Person │ Freq. Percent Cum.
────────────┼───────────────────────────────────
Male │ 3,156 61.35 61.35
Female │ 1,988 38.65 100.00
────────────┼───────────────────────────────────
Total │ 5,144 100.00
Government Office Region │
modified │ Freq. Percent Cum.
──────────────────────────┼───────────────────────────────────
North East │ 251 4.88 4.88
North West and Merseyside │ 585 11.37 16.25
Yorkshire and the Humber │ 462 8.98 25.23
East Midlands │ 424 8.24 33.48
West Midlands │ 526 10.23 43.70
Eastern │ 497 9.66 53.36
London │ 480 9.33 62.69
South East │ 681 13.24 75.93
South West │ 429 8.34 84.27
Wales │ 246 4.78 89.06
Scotland │ 412 8.01 97.06
Northern Ireland │ 151 2.94 100.00
──────────────────────────┼───────────────────────────────────
Total │ 5,144 100.00
Household size; │
number of │
persons in HH │
(recoded) │ Freq. Percent Cum.
─────────────────┼───────────────────────────────────
1 person │ 1,434 27.88 27.88
2 person │ 1,926 37.44 65.32
3 person │ 779 15.14 80.46
4 person │ 670 13.02 93.49
5 person or more │ 335 6.51 100.00
─────────────────┼───────────────────────────────────
Total │ 5,144 100.00
6.2. Using a forvalues loop, create 5
dummy variables (hhs1,…,hhs5), each indicating
one of the household size values (hhsize). Check that each
new variable takes on the right values by including the following in
your loop: tab hhs1 hhsize, miss.
. forvalues i = 1/5 {
2. gen hhs`i' = hhsize==`i'
3. tab hhs`i' hhsize, miss
4. }
│ Household size, number of people in household
│ (recoded)formerly A049r
hhs1 │ 1 2 3 4 5 │ Total
───────────┼───────────────────────────────────────────────────────┼──────────
0 │ 0 1,926 779 670 335 │ 3,710
1 │ 1,434 0 0 0 0 │ 1,434
───────────┼───────────────────────────────────────────────────────┼──────────
Total │ 1,434 1,926 779 670 335 │ 5,144
│ Household size, number of people in household
│ (recoded)formerly A049r
hhs2 │ 1 2 3 4 5 │ Total
───────────┼───────────────────────────────────────────────────────┼──────────
0 │ 1,434 0 779 670 335 │ 3,218
1 │ 0 1,926 0 0 0 │ 1,926
───────────┼───────────────────────────────────────────────────────┼──────────
Total │ 1,434 1,926 779 670 335 │ 5,144
│ Household size, number of people in household
│ (recoded)formerly A049r
hhs3 │ 1 2 3 4 5 │ Total
───────────┼───────────────────────────────────────────────────────┼──────────
0 │ 1,434 1,926 0 670 335 │ 4,365
1 │ 0 0 779 0 0 │ 779
───────────┼───────────────────────────────────────────────────────┼──────────
Total │ 1,434 1,926 779 670 335 │ 5,144
│ Household size, number of people in household
│ (recoded)formerly A049r
hhs4 │ 1 2 3 4 5 │ Total
───────────┼───────────────────────────────────────────────────────┼──────────
0 │ 1,434 1,926 779 0 335 │ 4,474
1 │ 0 0 0 670 0 │ 670
───────────┼───────────────────────────────────────────────────────┼──────────
Total │ 1,434 1,926 779 670 335 │ 5,144
│ Household size, number of people in household
│ (recoded)formerly A049r
hhs5 │ 1 2 3 4 5 │ Total
───────────┼───────────────────────────────────────────────────────┼──────────
0 │ 1,434 1,926 779 670 0 │ 4,809
1 │ 0 0 0 0 335 │ 335
───────────┼───────────────────────────────────────────────────────┼──────────
Total │ 1,434 1,926 779 670 335 │ 5,144
7.1 Demonstrate partitioned regression result
. _pctile P344pr, p(99)
. gen inc = P344pr if P344pr < r(r1)
(785 missing values generated)
.
. est clear
. eststo: reg exp inc
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(1, 4277) = 2854.45
Model │ 87293780.3 1 87293780.3 Prob > F = 0.0000
Residual │ 130797888 4,277 30581.6899 R-squared = 0.4003
─────────────┼────────────────────────────────── Adj R-squared = 0.4001
Total │ 218091668 4,278 50979.8196 Root MSE = 174.88
─────────────┬────────────────────────────────────────────────────────────────
exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
inc │ .4941574 .0092492 53.43 0.000 .4760242 .5122907
_cons │ 147.1124 5.450921 26.99 0.000 136.4258 157.7991
─────────────┴────────────────────────────────────────────────────────────────
(est1 stored)
. eststo: reg exp inc hhsize
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(2, 4276) = 1606.17
Model │ 93556576.4 2 46778288.2 Prob > F = 0.0000
Residual │ 124535092 4,276 29124.2029 R-squared = 0.4290
─────────────┼────────────────────────────────── Adj R-squared = 0.4287
Total │ 218091668 4,278 50979.8196 Root MSE = 170.66
─────────────┬────────────────────────────────────────────────────────────────
exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
inc │ .4322739 .0099639 43.38 0.000 .4127395 .4518084
hhsize │ 36.41656 2.483372 14.66 0.000 31.54786 41.28526
_cons │ 99.33041 6.238092 15.92 0.000 87.10051 111.5603
─────────────┴────────────────────────────────────────────────────────────────
(est2 stored)
Next, create residuals from regressing inc and exp against hhsize
. reg inc hhsize
Source │ SS df MS Number of obs = 4,359
─────────────┼────────────────────────────────── F(1, 4357) = 942.99
Model │ 65502874.6 1 65502874.6 Prob > F = 0.0000
Residual │ 302651300 4,357 69463.2316 R-squared = 0.1779
─────────────┼────────────────────────────────── Adj R-squared = 0.1777
Total │ 368154175 4,358 84477.7821 Root MSE = 263.56
─────────────┬────────────────────────────────────────────────────────────────
inc │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
hhsize │ 105.2338 3.426909 30.71 0.000 98.51532 111.9523
_cons │ 287.5106 8.523331 33.73 0.000 270.8005 304.2206
─────────────┴────────────────────────────────────────────────────────────────
. predict double M2inc, resid
(785 missing values generated)
.
. reg exp hhsize
Source │ SS df MS Number of obs = 4,888
─────────────┼────────────────────────────────── F(1, 4886) = 1269.03
Model │ 63789382.1 1 63789382.1 Prob > F = 0.0000
Residual │ 245599961 4,886 50266.0583 R-squared = 0.2062
─────────────┼────────────────────────────────── Adj R-squared = 0.2060
Total │ 309389343 4,887 63308.644 Root MSE = 224.2
─────────────┬────────────────────────────────────────────────────────────────
exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
hhsize │ 96.62228 2.712317 35.62 0.000 91.30492 101.9396
_cons │ 222.3487 6.983517 31.84 0.000 208.6578 236.0395
─────────────┴────────────────────────────────────────────────────────────────
. predict double M2exp, resid
(256 missing values generated)
Estimate model with residualized variables
. eststo: reg exp M2inc
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(1, 4277) = 1440.59
Model │ 54949821.1 1 54949821.1 Prob > F = 0.0000
Residual │ 163141847 4,277 38143.9904 R-squared = 0.2520
─────────────┼────────────────────────────────── Adj R-squared = 0.2518
Total │ 218091668 4,278 50979.8196 Root MSE = 195.3
─────────────┬────────────────────────────────────────────────────────────────
exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
M2inc │ .4327982 .0114029 37.96 0.000 .4104426 .4551538
_cons │ 402.5694 2.985992 134.82 0.000 396.7153 408.4235
─────────────┴────────────────────────────────────────────────────────────────
(est3 stored)
. eststo: reg M2exp M2inc
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(1, 4277) = 1863.15
Model │ 54792875.5 1 54792875.5 Prob > F = 0.0000
Residual │ 125781329 4,277 29408.7745 R-squared = 0.3034
─────────────┼────────────────────────────────── Adj R-squared = 0.3033
Total │ 180574204 4,278 42209.9589 Root MSE = 171.49
─────────────┬────────────────────────────────────────────────────────────────
M2exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
M2inc │ .4321797 .0100125 43.16 0.000 .4125501 .4518093
_cons │ -30.88738 2.621887 -11.78 0.000 -36.02764 -25.74712
─────────────┴────────────────────────────────────────────────────────────────
(est4 stored)
. eststo: reg M2exp inc
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(1, 4277) = 1175.17
Model │ 38921396.6 1 38921396.6 Prob > F = 0.0000
Residual │ 141652808 4,277 33119.6651 R-squared = 0.2155
─────────────┼────────────────────────────────── Adj R-squared = 0.2154
Total │ 180574204 4,278 42209.9589 Root MSE = 181.99
─────────────┬────────────────────────────────────────────────────────────────
M2exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
inc │ .329965 .0096254 34.28 0.000 .3110943 .3488357
_cons │ -202.0139 5.672599 -35.61 0.000 -213.1352 -190.8927
─────────────┴────────────────────────────────────────────────────────────────
(est5 stored)
.
. esttab, se stat(r2 rss mss F df_m N) rename(M2inc inc)
────────────────────────────────────────────────────────────────────────────────────────────
(1) (2) (3) (4) (5)
exp exp exp M2exp M2exp
────────────────────────────────────────────────────────────────────────────────────────────
inc 0.494*** 0.432*** 0.433*** 0.432*** 0.330***
(0.00925) (0.00996) (0.0114) (0.0100) (0.00963)
hhsize 36.42***
(2.483)
_cons 147.1*** 99.33*** 402.6*** -30.89*** -202.0***
(5.451) (6.238) (2.986) (2.622) (5.673)
────────────────────────────────────────────────────────────────────────────────────────────
r2 0.400 0.429 0.252 0.303 0.216
rss 130797887.8 124535091.7 163141847.0 125781328.7 141652807.6
mss 87293780.3 93556576.4 54949821.1 54792875.5 38921396.6
F 2854.4 1606.2 1440.6 1863.1 1175.2
df_m 1 2 1 1 1
N 4279 4279 4279 4279 4279
────────────────────────────────────────────────────────────────────────────────────────────
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
Models 2,3,4 yield the same coefficient (up to some rounding error). What about the SE’s? Between 3 and 4, which gives you the correct projection and SE’s? In model 3, we have artificially inflated the amount of variation in the outcome by failing to residualize it. The estimator has a higher SE, as the residual variation is higher compared to 2 and 4.
Question: should the other model stats be the same? For example, R2?
7.2 Demonstrate equality of projections with base category.
. tab P425r, gen(inc_src)
Main source │
of household │
income │
(recoded) │ Freq. Percent Cum.
──────────────┼───────────────────────────────────
Earned income │ 2,841 55.23 55.23
Other income │ 2,303 44.77 100.00
──────────────┼───────────────────────────────────
Total │ 5,144 100.00
.
. est clear
. eststo: reg exp inc inc_src2
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(2, 4276) = 1438.20
Model │ 87707427.2 2 43853713.6 Prob > F = 0.0000
Residual │ 130384241 4,276 30492.105 R-squared = 0.4022
─────────────┼────────────────────────────────── Adj R-squared = 0.4019
Total │ 218091668 4,278 50979.8196 Root MSE = 174.62
─────────────┬────────────────────────────────────────────────────────────────
exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
inc │ .4729693 .0108807 43.47 0.000 .4516374 .4943012
inc_src2 │ -23.17541 6.292255 -3.68 0.000 -35.5115 -10.83933
_cons │ 169.8991 8.240186 20.62 0.000 153.744 186.0541
─────────────┴────────────────────────────────────────────────────────────────
(est1 stored)
. eststo: reg exp inc inc_src1
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(2, 4276) = 1438.20
Model │ 87707427.2 2 43853713.6 Prob > F = 0.0000
Residual │ 130384241 4,276 30492.105 R-squared = 0.4022
─────────────┼────────────────────────────────── Adj R-squared = 0.4019
Total │ 218091668 4,278 50979.8196 Root MSE = 174.62
─────────────┬────────────────────────────────────────────────────────────────
exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
inc │ .4729693 .0108807 43.47 0.000 .4516374 .4943012
inc_src1 │ 23.17541 6.292255 3.68 0.000 10.83933 35.5115
_cons │ 146.7236 5.443954 26.95 0.000 136.0507 157.3966
─────────────┴────────────────────────────────────────────────────────────────
(est2 stored)
. eststo: reg exp inc inc_src1 inc_src2, nocons
Source │ SS df MS Number of obs = 4,279
─────────────┼────────────────────────────────── F(3, 4276) = 8477.17
Model │ 775460099 3 258486700 Prob > F = 0.0000
Residual │ 130384241 4,276 30492.105 R-squared = 0.8561
─────────────┼────────────────────────────────── Adj R-squared = 0.8560
Total │ 905844340 4,279 211695.335 Root MSE = 174.62
─────────────┬────────────────────────────────────────────────────────────────
exp │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
inc │ .4729693 .0108807 43.47 0.000 .4516374 .4943012
inc_src1 │ 169.8991 8.240186 20.62 0.000 153.744 186.0541
inc_src2 │ 146.7236 5.443954 26.95 0.000 136.0507 157.3966
─────────────┴────────────────────────────────────────────────────────────────
(est3 stored)
.
. esttab, se stat(r2 rss mss F df_m N)
────────────────────────────────────────────────────────────
(1) (2) (3)
exp exp exp
────────────────────────────────────────────────────────────
inc 0.473*** 0.473*** 0.473***
(0.0109) (0.0109) (0.0109)
inc_src2 -23.18*** 146.7***
(6.292) (5.444)
inc_src1 23.18*** 169.9***
(6.292) (8.240)
_cons 169.9*** 146.7***
(8.240) (5.444)
────────────────────────────────────────────────────────────
r2 0.402 0.402 0.856
rss 130384240.9 130384240.9 130384240.9
mss 87707427.2 87707427.2 775460098.6
F 1438.2 1438.2 8477.2
df_m 2 2 3
N 4279 4279 4279
────────────────────────────────────────────────────────────
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
These are essentially the same models and regardless of choice on base category the coefficient (and SE) on inc does not change. Thus, M2 projection is the same in all 3 models. They have the same RSS. You will notice that Stata’s R2 and F-stat are different for model 3. This is because Stata seems to compute a different MSS (or ESS) for model 3. It’s not clear why, by likely has to do with the absence of a constant. If you examine the predicted values, they all have the same variance; suggesting the ESS should be the same. A mystery to be resolve.
. qui reg exp inc inc_src2
. predict exp_hat1
(option xb assumed; fitted values)
(785 missing values generated)
. qui reg exp inc inc_src1
. predict exp_hat2
(option xb assumed; fitted values)
(785 missing values generated)
. qui reg exp inc inc_src1 inc_src2, nocons
. predict exp_hat3
(option xb assumed; fitted values)
(785 missing values generated)
.
. sum exp_hat*
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp_hat1 │ 4,359 403.4237 143.8882 146.7236 730.3263
exp_hat2 │ 4,359 403.4237 143.8882 146.7236 730.3263
exp_hat3 │ 4,359 403.4237 143.8882 146.7236 730.3263
Before you close finish up ensure that your do-file includes the follow at the end,
log close
And check that the do-file runs without error, replicating all of your results. You can do this using the “do” icon at the top right of the do-file editing window.
Finally, don’t forget to save changes to your do-file and do NOT save changes to the data when you close Stata.
Stata formatted datasets use the extension “.dta”. You can open a range of other file types in Stata using the ‘import’ command. Go to “File>Import” in the main Stata window.↩︎
Avoid the editor-icon (image of table and
pencil) or “Data>Data Editor>Data Editor (Edit)”. It opens
the dataset like browse, but allows you to manually edit
cell values, just like in an Excel spreadsheet. You want to avoid doing
this as the process is not replicable. More accurately, these actions
are recorded based on the row-column value of the cell being editted.
They are replicable, but dependent on the sorting of observations (i.e.,
rows) and order of the variables (i.e., columns), which may change.↩︎