Introduction to Stata

8 Oct 2024

This exercise sheet is designed as an introduction to Stata for those studying Econometrics A & B.

Why Stata?

Stata is a powerful and reliable statistical analysis software. It is not a language, like R or Python, but does have command-based functionalities. Those who are familiar with Stata’s commands, will effectively be able to programme in scripts called do-files (named for their file extension “.do”). Stata does have its own mathematical (compiled) language, called Mata, which is used in the programming of many commands. It also includes integration for C, C++, R, and Python.

Stata remains the dominant software/language in applied Economics research (see R-bloggers post). This is a product of history (i.e. luck) and some of Stata’s key strengths. It is a software designed primarily for the analysis of cross-sectional and panel datasets, and has grown its market share with the rise in availability of digitized household survey data (beginning in the 1980s). Stata is very efficient at analyzing small-to-medium sized datasets. The efficiency derives from the fact that it loads the data into memory. With modern computers, that typically have >8GB of RAM, you will have no problem analyzing a household survey.

However, this efficiency is also it’s “Achilles’ heel”. Stata is losing market share in the age of “big data”. While there are more efficient ways to analyse big-ger datasets in Stata, it remains bound by the amount of memory available.

Stata’s dominance in empirical research also relates to its peer reviewed nature. Commands (packages) that come pre-installed in Stata, along with those installed from Statistical Software Components (using the command ssc install), are heavily vetted. For this reason, Stata is used by consultancies that conduct research for legal cases.

As the dominant research software, it is important that you have a working knowledge of Stata, even if you are fluent in other programming languages. There might be a research assistant opportunity in your future that requires you to work with Stata. For those less confident with ‘programming’, Stata is an easier introduction to data analysis.

In the market today, a wider range of software and programming-language experience is extremely valuable. We encourage you to develop your knowledge of Stata while investing in languages like R and Python.

Problem Set 1

The small dataset (filename “data.dta”)¹ is based on the Living Cost and Food Survey, 2013 (LCF). The LCF, conducted by the Office for National Statistics, collects information on spending patterns and the cost of living that reflects household budgets across the UK. This teaching dataset is a subset which has been subject to certain simplifications and additions, for learning and teaching by CMIST, Manchester.

This problem set will review a sample of helpful Stata commands. In addition, a do-file with some comments will be provided along with the exercises. Please work through this exercise in conjunction with the do-file. Some of the commands listed below may seem irrelevant for the analysis of this particular dataset. They have been included because of their usefulness in other settings.

Preamble

a. Create a folder on your computer where you intend to save this project. For example, “…/EC910/Seminars/Seminar-1”.

b. Download the dataset from Moodle - “data.dta” - and save/move it to the above folder.

c. Open Stata

d. Open a new do-file, either using the do-file icon (image of pen and paper) or “Window>Do-file Editor>New Do-file Editor”.

e. Save the new do-file to the project folder; e.g. “problem-set-1.do”.

f. Give the do file a title, author, and date. The treat text as a comment, you need to use the symbols *, //, or /* followed by */. For example,

. * ──────────────────────────────────────────────────────────────────────────── *
. * title: Problem Set 1
. * author: Neil Lloyd
. * date: 14 October 2024
. * ──────────────────────────────────────────────────────────────────────────── *

g. Open the dataset “data.dta”. You can go to “File>Open”, then locate the dataset on your computer and then select “Open”. However, doing so is not replicable. Instead, you can use the use command from within your do-file.

Add the following line to your do-file:

use "filepath/data.dta", clear

subsituting in the word “filepath” with the full filepath of your data. Alternatively, you can set a default directory for the project using cd. If set, this will become the default directory where Stata searches for a file or saves a file.

cd "filepath"

use data.dta, clear

Note, double quotations over the file path/name are only necessary when the path/name contains blank spaces; e.g. “problem set 1 data.dta”. This is one reason to avoid spaces in names.

Second, the option clear tells Stata to remove any existing dataset from memory before opening this one. You cannot open two datasets simultaneously. Any changes to the existing dataset will be lost unless saved. But this is not necessarily a bad thing. Read on.

h. Open a log file in which to record the output from Stata. If you have set the default directory using cd, you can do so as follows:

log using problem-set-1-log, replace

This will save the file using Stata’s own log-file format (file extension “.scml”). It’s better to save them as a simple text-file:

log using problem-set-1-log, replace text

log using problem-set-1-log.txt, replace

In this way, they open in a any text editor, like Notepad.

When you open a log-file, it will record all output (in the output window), until you tell it to stop: log close. This should appear at the end of your do-file.

As a suggestion, including the following before you open a log-file:

cap log close

log using problem-set-1-log.txt, replace

followed by,

log close

at the end. We won’t explain why, but it will help you avoid an annoying feature of how log-files and do-files interact.

Important principle

When you open a dataset in Stata, the software creates a copy in memory. Any changes you make to this copy will not affect the version you have saved on your harddrive, UNLESS you save those changes. It is a good principle to not save changes to the dataset, but rather to keep a copy of all changes made as a list of commands in a saved do-file. This ensures replicability.

If you want to create a new dataset, save it under a new name. The command below will do just this.

save new-data.dta, replace

When you close Stata it will ask you if you want to save changes to the opened dataset. Unless you have a clear reason to so, SAY NO. Ensure that your do-file is saved, well organized, and well commented. As a new feature of Stata 18, the software keeps a back-up of unsaved changes to the do-file (with file extension “.stswp”) in case the software or your computer crashes.

Part 1: Review the data

1.1. Review the data. You can either use the browse-icon (image of table and magnifying glass) or “Data>Data Editor>Data Editor (Browse)”.² Else, in your do-file (or the main window’s command line) type,

1.2. Use the describe (or simply de) command to learn about the variables in the dataset. Do you know what any of this information means?

1.3. Use the summarize (or simply su) to recover some basic summary statistics of the variable P550tpr.

1.4. Use the detail option to learn more about the variable P344pr. Are there any abnormal values or outliers?

1.5. The summarize command doesn’t tell you anything about missing data. Check to see if the number of observations above matches the total observations in the dataset using count. Was there any missing data?

1.6. List the first 10 values of variables P550tpr and P344pr in the dataset using the command `list’.

1.7. Make a simple frequency table of the values in A121r using the command tabulate.

Here are a list of other commands to help you navigate and learn about the dataset:

lookfor
codebook

Part 2: Graphing Basics

2.1. Make a histogram depicting the frequency of values in the variable P550tpr. Try make it so that the y-axis is in percentage-points and not density (Stata’s default). To check the options of a command, type: help histogram.

2.2. Modify the above graph so that the width of each bin is £10. Which graph is more informative?

2.3. Use Stata’s flexible twoway graph function to create a scatter plot of of the relationship bewteen P550tpr and P344pr.

2.4. Overlay the above scatter plot with a line graph of the linear fit between the same two variables. Hint: use the graphing command lfit. If you can make the colour of the fitted line red.

2.5. Replicate 2.3., weighting each observation by household size. That is, give more weight to larger households. Add a note to the graph that explains this weighting. Play around with the marker fill and outline colours so that the graph is more readable. For example, you can adjust the opacity of the fill colour of each marker by including the option: mfc(blue%20).

Part 3: Generating variables

In Stata you the common command for creating a new variables is generate. There are other operators, like egen and recode, that have more specific functions.

3.1. Use the gen command to create a new variable equal to the consumption expenditure per pound of income variable apc=P550tpr/P344pr. Next, label the variable “Average Propensity to Consume”. Hint: type help label. To check that the new label has been applied, type describe apc to check that it has worked.

3.2. The variable p550tpr contains some very strange values at the upper end of the distribution: type sum P550tpr, det. Use the command _pctile p550tpr, p(99) in order to find the value above which 1% of points of P550tpr lie.

3.3. Use the above information to create a new variable (called exp) that is a duplicate of P55tpr, but replaces the top 1% of values to missing. In Stata missing values for numerical variables take the value “.”. You can also use {“.a”, “.b”,…} if you want to assign different categories of missing. For string variables, a missing value is just an empty string: ““.

Try to avoid using explicit numerical values in your code. For example, instead of copying the 99th percentile from question 14, use the Stata’s stored value. The command _pctile is an r-class command (many estimation commands are e-class). You can see the stored values with: return list.

3.4. Produce a suitably labelled histogram of exp.

3.5. Using the approach from 3.3, replace the top 1% of values in apc to missing.

3.6. Produce suitably labelled horizontal histogram of apc and export this graph as a .pdf file. Hint: help graph export.

Part 4: Summary statistics

4.1. Use the tabulate and/or table commands to learn about the frequency distributions of P425r, A172, SexHRP, Gorx, and A049r. Note, if you you can also cross-tabulate two variables; for example, tab P435r A172.

4.2. Report summary statistics for the variable exp separately by main source of income (P425r). Since, exp is a continuous variable you do not want to use tabulate. Instead, use the summarize command. This can be combined with a categorical in a few ways: (1) tab catvar, sum(continvar); bysort catvar: sum continvar, det; (3) table catvar, stat(mean continvar sd continvar count continvar). The table option is the most flexible, as there is a wider range of statistics available; including percentiles.

4.3. Produce a suitably labelled histogram of exp separately for households according to their main source of income (P425r).

4.4. Construct summary statistics for exp for households according to both the main source of income (P425r), and internet connection (A172). Here, it is best to use the table command.

4.5. Export an the table from 4.2. to an Excel spreadsheet using dtable.

Part 5: Basic hypothesis tests

5.1. At the 1% significance level, test the hypothesis of no difference in exp between households with earnings and other sources as their main source of income (P425r). Hint: help ttest. Do consider the parameters of the test: one-side or two-sided, equal or unequal variance, significance level.

5.2. At the 1% significance level, test the hypothesis of no differences in exp between households with earnings and other sources as their main source of income (P425r) by internet connection (A172).

5.3. From the variable P344tp create a suitably labelled binary variable inc_m which is =1 if above median income, and =0 otherwise. Label the values 0 and 1 within this variable.

5.4. At the 1% significance level, test for a difference in the mean expenditure (exp) of those above and below median income.

5.5. From the variable P344tp create a categorical variable inc_cat based on the quintiles of P344tp. Label the variableinc_cat and the 5 values within this variable. You can either do this manually or explore help _pctile.

5.6. Using inc_cat, produce a suitably labelled plot of the mean of exp for each of the 5 income categories. Try to use the graph bar function.

5.7. Produce a plot the mean of exp against inc_cat, separately by internet connection (A172).

Part 6: Loops

A loop is an operator that repeats the same operation through a given sequence. The three main loops are foreach, forvalues, and while. Loops can be used to make code more efficient and also in the design of your own programme/function (see help program). Within a loop, you must refer to the running argument using local notation. For example, the following loop creates a variable that is equal to it’s row value for rows 1 through 10.

. gen number = .
(5,144 missing values generated)

. forvalues i = 1/10 {
  2. dis "Number `i'"
  3. replace number  = `i' in `i'
  4. }
Number 1
(1 real change made)
Number 2
(1 real change made)
Number 3
(1 real change made)
Number 4
(1 real change made)
Number 5
(1 real change made)
Number 6
(1 real change made)
Number 7
(1 real change made)
Number 8
(1 real change made)
Number 9
(1 real change made)
Number 10
(1 real change made)

. list number in 1/10

     ┌────────┐
     │ number │
     ├────────┤
  1. │      1 │
  2. │      2 │
  3. │      3 │
  4. │      4 │
  5. │      5 │
     ├────────┤
  6. │      6 │
  7. │      7 │
  8. │      8 │
  9. │      9 │
 10. │     10 │
     └────────┘

The curly brackets denote the beginning an end of the actions included in each loop. You can use the display command like print() from other programming languages. If you want to add conditions to a loop, you can do so using if, else if, and else statements. For example,

. gen str even = ""
(5,144 missing values generated)

. forvalues i = 1/10 {
  2. if mod(`i',2) == 0 { 
  3. dis "Number `i' is even"
  4. replace even  = "even" in `i'
  5. }
  6. else {
  7. dis "Number `i' is odd"    
  8. replace even = "odd" in `i'
  9. }
 10. }
Number 1 is odd
variable even was str1 now str3
(1 real change made)
Number 2 is even
variable even was str3 now str4
(1 real change made)
Number 3 is odd
(1 real change made)
Number 4 is even
(1 real change made)
Number 5 is odd
(1 real change made)
Number 6 is even
(1 real change made)
Number 7 is odd
(1 real change made)
Number 8 is even
(1 real change made)
Number 9 is odd
(1 real change made)
Number 10 is even
(1 real change made)

. list number even in 1/10

     ┌───────────────┐
     │ number   even │
     ├───────────────┤
  1. │      1    odd │
  2. │      2   even │
  3. │      3    odd │
  4. │      4   even │
  5. │      5    odd │
     ├───────────────┤
  6. │      6   even │
  7. │      7    odd │
  8. │      8   even │
  9. │      9    odd │
 10. │     10   even │
     └───────────────┘

6.1. Repeat exercise 4.1., where you tabulated each of the variables, using a foreach loop.

6.2. Using a forvalues loop, create 5 dummy variables (hhs1,…,hhs5), each indicating one of the household size values (hhsize). Check that each new variable takes on the right values by including the following in your loop: tab hhs1 hhsize, miss.

Postamble

Before you close finish up ensure that your do-file includes the follow at the end,

log close

And check that the do-file runs without error, replicating all of your results. You can do this using the “do” icon at the top right of the do-file editing window.

Finally, don’t forget to save changes to your do-file and do NOT save changes to the data when you close Stata.