8 Oct 2024
This exercise sheet is designed as an introduction to Stata for those studying Econometrics A & B.
Stata is a powerful and reliable statistical analysis software. It is not a language, like R or Python, but does have command-based functionalities. Those who are familiar with Stata’s commands, will effectively be able to programme in scripts called do-files (named for their file extension “.do”). Stata does have its own mathematical (compiled) language, called Mata, which is used in the programming of many commands. It also includes integration for C, C++, R, and Python.
Stata remains the dominant software/language in applied Economics research (see R-bloggers post). This is a product of history (i.e. luck) and some of Stata’s key strengths. It is a software designed primarily for the analysis of cross-sectional and panel datasets, and has grown its market share with the rise in availability of digitized household survey data (beginning in the 1980s). Stata is very efficient at analyzing small-to-medium sized datasets. The efficiency derives from the fact that it loads the data into memory. With modern computers, that typically have >8GB of RAM, you will have no problem analyzing a household survey.
However, this efficiency is also it’s “Achilles’ heel”. Stata is losing market share in the age of “big data”. While there are more efficient ways to analyse big-ger datasets in Stata, it remains bound by the amount of memory available.
Stata’s dominance in empirical research also relates to its peer
reviewed nature. Commands (packages) that come pre-installed in Stata,
along with those installed from Statistical Software Components (using
the command ssc install), are heavily vetted. For this
reason, Stata is used by consultancies that conduct research for legal
cases.
As the dominant research software, it is important that you have a working knowledge of Stata, even if you are fluent in other programming languages. There might be a research assistant opportunity in your future that requires you to work with Stata. For those less confident with ‘programming’, Stata is an easier introduction to data analysis.
In the market today, a wider range of software and programming-language experience is extremely valuable. We encourage you to develop your knowledge of Stata while investing in languages like R and Python.
The small dataset (filename “data.dta”)1 is based on the Living Cost and Food Survey, 2013 (LCF). The LCF, conducted by the Office for National Statistics, collects information on spending patterns and the cost of living that reflects household budgets across the UK. This teaching dataset is a subset which has been subject to certain simplifications and additions, for learning and teaching by CMIST, Manchester.
This problem set will review a sample of helpful Stata commands. In addition, a do-file with some comments will be provided along with the exercises. Please work through this exercise in conjunction with the do-file. Some of the commands listed below may seem irrelevant for the analysis of this particular dataset. They have been included because of their usefulness in other settings.
a. Create a folder on your computer where you intend to save this project. For example, “…/EC910/Seminars/Seminar-1”.
b. Download the dataset from Moodle - “data.dta” - and save/move it to the above folder.
c. Open Stata
d. Open a new do-file, either using the do-file icon (image of pen and paper) or “Window>Do-file Editor>New Do-file Editor”.
e. Save the new do-file to the project folder; e.g. “problem-set-1.do”.
f. Give the do file a title, author, and date. The
treat text as a comment, you need to use the symbols *,
//, or /* followed by */. For
example,
. * ──────────────────────────────────────────────────────────────────────────── * . * title: Problem Set 1 . * author: Neil Lloyd . * date: 14 October 2024 . * ──────────────────────────────────────────────────────────────────────────── *
g. Open the dataset “data.dta”. You can go to
“File>Open”, then locate the dataset on your computer and then select
“Open”. However, doing so is not replicable. Instead, you can use the
use command from within your do-file.
Add the following line to your do-file:
use "filepath/data.dta", clear
subsituting in the word “filepath” with the full filepath of your
data. Alternatively, you can set a default directory for the project
using cd. If set, this will become the default directory
where Stata searches for a file or saves a file.
cd "filepath"
use data.dta, clear
Note, double quotations over the file path/name are only necessary when the path/name contains blank spaces; e.g. “problem set 1 data.dta”. This is one reason to avoid spaces in names.
Second, the option clear tells Stata to remove any
existing dataset from memory before opening this one. You cannot open
two datasets simultaneously. Any changes to the existing dataset will be
lost unless saved. But this is not necessarily a bad thing. Read on.
h. Open a log file in which to record the output
from Stata. If you have set the default directory using cd,
you can do so as follows:
log using problem-set-1-log, replace
This will save the file using Stata’s own log-file format (file extension “.scml”). It’s better to save them as a simple text-file:
log using problem-set-1-log, replace text
or
log using problem-set-1-log.txt, replace
In this way, they open in a any text editor, like Notepad.
When you open a log-file, it will record all output (in the output
window), until you tell it to stop: log close. This should
appear at the end of your do-file.
As a suggestion, including the following before you open a log-file:
cap log close
log using problem-set-1-log.txt, replace
followed by,
log close
at the end. We won’t explain why, but it will help you avoid an annoying feature of how log-files and do-files interact.
When you open a dataset in Stata, the software creates a copy in memory. Any changes you make to this copy will not affect the version you have saved on your harddrive, UNLESS you save those changes. It is a good principle to not save changes to the dataset, but rather to keep a copy of all changes made as a list of commands in a saved do-file. This ensures replicability.
If you want to create a new dataset, save it under a new name. The command below will do just this.
save new-data.dta, replace
When you close Stata it will ask you if you want to save changes to the opened dataset. Unless you have a clear reason to so, SAY NO. Ensure that your do-file is saved, well organized, and well commented. As a new feature of Stata 18, the software keeps a back-up of unsaved changes to the do-file (with file extension “.stswp”) in case the software or your computer crashes.
1.1. Review the data. You can either use the browse-icon (image of table and magnifying glass) or “Data>Data Editor>Data Editor (Browse)”.2 Else, in your do-file (or the main window’s command line) type,
1.2. Use the describe (or simply
de) command to learn about the variables in the dataset. Do
you know what any of this information means?
1.3. Use the summarize (or simply
su) to recover some basic summary statistics of the
variable P550tpr.
1.4. Use the detail option to learn
more about the variable P344pr. Are there any abnormal
values or outliers?
1.5. The summarize command doesn’t tell
you anything about missing data. Check to see if the number of
observations above matches the total observations in the dataset using
count. Was there any missing data?
1.6. List the first 10 values of variables
P550tpr and P344pr in the dataset using the
command `list’.
1.7. Make a simple frequency table of the values in
A121r using the command tabulate.
Here are a list of other commands to help you navigate and learn about the dataset:
lookforcodebook2.1. Make a histogram depicting the frequency of
values in the variable P550tpr. Try make it so that the
y-axis is in percentage-points and not density (Stata’s default). To
check the options of a command, type: help histogram.
2.2. Modify the above graph so that the width of each bin is £10. Which graph is more informative?
2.3. Use Stata’s flexible twoway graph
function to create a scatter plot of of the relationship bewteen
P550tpr and P344pr.
2.4. Overlay the above scatter plot with a line
graph of the linear fit between the same two variables. Hint: use the
graphing command lfit. If you can make the colour of the
fitted line red.
2.5. Replicate 2.3., weighting each observation by
household size. That is, give more weight to larger households. Add a
note to the graph that explains this weighting. Play around with the
marker fill and outline colours so that the graph is more readable. For
example, you can adjust the opacity of the fill colour of each marker by
including the option: mfc(blue%20).
In Stata you the common command for creating a new variables is
generate. There are other operators, like egen
and recode, that have more specific functions.
3.1. Use the gen command to create a
new variable equal to the consumption expenditure per pound of income
variable apc=P550tpr/P344pr.
Next, label the variable “Average Propensity to Consume”. Hint: type
help label. To check that the new label has been applied,
type describe apc to check that it has worked.
3.2. The variable p550tpr contains some
very strange values at the upper end of the distribution: type
sum P550tpr, det. Use the command
_pctile p550tpr, p(99) in order to find the value above
which 1% of points of P550tpr lie.
3.3. Use the above information to create a new
variable (called exp) that is a duplicate of
P55tpr, but replaces the top 1% of values to missing. In
Stata missing values for numerical variables take the value “.”. You can
also use {“.a”, “.b”,…} if you want to assign different categories of
missing. For string variables, a missing value is just an empty string:
““.
Try to avoid using explicit numerical values in your code. For
example, instead of copying the 99th percentile from question 14, use
the Stata’s stored value. The command _pctile is an r-class
command (many estimation commands are e-class). You can see the stored
values with: return list.
3.4. Produce a suitably labelled histogram of
exp.
3.5. Using the approach from 3.3, replace the top 1%
of values in apc to missing.
3.6. Produce suitably labelled
horizontal histogram of apc and export this graph as a
.pdf file. Hint: help graph export.
4.1. Use the tabulate and/or
table commands to learn about the frequency distributions
of P425r, A172, SexHRP,
Gorx, and A049r. Note, if you you can also
cross-tabulate two variables; for example,
tab P435r A172.
4.2. Report summary statistics for the variable
exp separately by main source of income
(P425r). Since, exp is a continuous variable
you do not want to use tabulate. Instead, use the
summarize command. This can be combined with a categorical
in a few ways: (1) tab catvar, sum(continvar);
bysort catvar: sum continvar, det; (3)
table catvar, stat(mean continvar sd continvar count continvar).
The table option is the most flexible, as there is a wider
range of statistics available; including percentiles.
4.3. Produce a suitably labelled histogram of
exp separately for households according to their main
source of income (P425r).
4.4. Construct summary statistics for
exp for households according to both the main source of
income (P425r), and internet connection
(A172). Here, it is best to use the table
command.
4.5. Export an the table from 4.2. to an Excel
spreadsheet using dtable.
5.1. At the 1% significance level, test the
hypothesis of no difference in exp between households with
earnings and other sources as their main source of income
(P425r). Hint: help ttest. Do consider the
parameters of the test: one-side or two-sided, equal or unequal
variance, significance level.
5.2. At the 1% significance level, test the
hypothesis of no differences in exp between households with
earnings and other sources as their main source of income
(P425r) by internet connection (A172).
5.3. From the variable P344tp create a
suitably labelled binary variable inc_m which is =1 if
above median income, and =0 otherwise. Label the values 0 and 1 within
this variable.
5.4. At the 1% significance level, test for a
difference in the mean expenditure (exp) of those above and
below median income.
5.5. From the variable P344tp create a
categorical variable inc_cat based on the quintiles of
P344tp. Label the variableinc_cat and the 5 values within
this variable. You can either do this manually or explore
help _pctile.
5.6. Using inc_cat, produce a suitably
labelled plot of the mean of exp for each of the 5 income categories.
Try to use the graph bar function.
5.7. Produce a plot the mean of exp against
inc_cat, separately by internet connection
(A172).
A loop is an operator that repeats the same operation through a given
sequence. The three main loops are foreach,
forvalues, and while. Loops can be used to
make code more efficient and also in the design of your own
programme/function (see help program). Within a loop, you
must refer to the running argument using local notation. For example,
the following loop creates a variable that is equal to it’s row value
for rows 1 through 10.
. gen number = .
(5,144 missing values generated)
. forvalues i = 1/10 {
2. dis "Number `i'"
3. replace number = `i' in `i'
4. }
Number 1
(1 real change made)
Number 2
(1 real change made)
Number 3
(1 real change made)
Number 4
(1 real change made)
Number 5
(1 real change made)
Number 6
(1 real change made)
Number 7
(1 real change made)
Number 8
(1 real change made)
Number 9
(1 real change made)
Number 10
(1 real change made)
. list number in 1/10
┌────────┐
│ number │
├────────┤
1. │ 1 │
2. │ 2 │
3. │ 3 │
4. │ 4 │
5. │ 5 │
├────────┤
6. │ 6 │
7. │ 7 │
8. │ 8 │
9. │ 9 │
10. │ 10 │
└────────┘
The curly brackets denote the beginning an end of the actions
included in each loop. You can use the display command like
print() from other programming languages. If you want to
add conditions to a loop, you can do so using if,
else if, and else statements. For example,
. gen str even = ""
(5,144 missing values generated)
. forvalues i = 1/10 {
2. if mod(`i',2) == 0 {
3. dis "Number `i' is even"
4. replace even = "even" in `i'
5. }
6. else {
7. dis "Number `i' is odd"
8. replace even = "odd" in `i'
9. }
10. }
Number 1 is odd
variable even was str1 now str3
(1 real change made)
Number 2 is even
variable even was str3 now str4
(1 real change made)
Number 3 is odd
(1 real change made)
Number 4 is even
(1 real change made)
Number 5 is odd
(1 real change made)
Number 6 is even
(1 real change made)
Number 7 is odd
(1 real change made)
Number 8 is even
(1 real change made)
Number 9 is odd
(1 real change made)
Number 10 is even
(1 real change made)
. list number even in 1/10
┌───────────────┐
│ number even │
├───────────────┤
1. │ 1 odd │
2. │ 2 even │
3. │ 3 odd │
4. │ 4 even │
5. │ 5 odd │
├───────────────┤
6. │ 6 even │
7. │ 7 odd │
8. │ 8 even │
9. │ 9 odd │
10. │ 10 even │
└───────────────┘
6.1. Repeat exercise 4.1., where you tabulated each
of the variables, using a foreach loop.
6.2. Using a forvalues loop, create 5
dummy variables (hhs1,…,hhs5), each indicating
one of the household size values (hhsize). Check that each
new variable takes on the right values by including the following in
your loop: tab hhs1 hhsize, miss.
Before you close finish up ensure that your do-file includes the follow at the end,
log close
And check that the do-file runs without error, replicating all of your results. You can do this using the “do” icon at the top right of the do-file editing window.
Finally, don’t forget to save changes to your do-file and do NOT save changes to the data when you close Stata.