2 Stata Review
The key to using Stata effectively is understanding its fundamental syntax, which applies to the vast majority of Stata commands.
2.1 Stata Commands
Stata is a command-based language. A Stata command is usually a verb, like use
, generate
, replace
, summarize
, or regress
. Most commands can be abbreviated, the exception being those that can destroy data (thus generate
can be abbreviated gen
, but replace
cannot be abbreviated). Some commands have subcommands, like label variable
and label value
. Stata normally has exactly one data set in memory, and commands act on that data set.
2.2 Subsetting by Variables
If a command is followed by list of variables, or varlist, the command will only act on those variables.
2.3 Subsetting by Observations
If a command is followed by the word if
and a logical condition, the command will only act on those observations where the condition is true.
2.4 Options
If a command is followed by a comma, then anything that comes after the comma is interpreted as one or more options that change how the command runs. If an option needs additional information, like a number or the name of a variable, that information goes in parentheses after the option.
2.5 Combining Elements
These syntax elements always go in the same order:
command [varlist] [if condition] [, options]
The square brackets indicate that most of these elements are optional. The command:
sum x if y==1
tells Stata to summarize (i.e. produce summary statistics for) the variable x
, but only for those observations where y
is 1.
The command:
tab y, sum(x)
tells Stata to tabulate (i.e. produce a frequency table for) the variable y
. The sum()
option tells Stata to also calculate basic summary statistics for the variable x
for the observations in each cell of the table. Because sum()
needs to know what variable you want it to act on, the name of the variable goes in parentheses after the option name.
2.6 Creating Variables
The Stata command to create a variable is generate
, usually abbreviated gen
. The syntax is:
gen name = expression
where name
is the name of the variable to be created and expression
is some mathematical expression you write. The expression will typically include some combination of numbers, variables, and mathematical functions.
The command to change the value of an existing variable is replace
, and has the same syntax:
replace name = expression
These commands act on all the observations for a given variable (unless you limit it with an if condition) but one at a time: the new value for a given observation depends on the expression as evaluated for the same observation.
The egen
command, short for extended generate, allows you to use a variety of useful functions like mean()
, max()
, and total()
. Many egen
functions, including all the examples listed, are aggregate functions: they take multiple values as input and give back a single value as output. Thus, unlike gen
, many egen
functions work across observations.
2.7 Do Files
A Stata do file is a text file containing Stata commands. You can write them using Stata’s built-in do file editor. While typing or clicking on commands interactively is a great way to explore your data and check your work, actual data wrangling should always be done using do files.
Working in Stata involves three separate things: commands, data, and results. A proper do file manages all three. It contains the commands, it loads and saves the data, and it records the results in a log file.
A good template to follow is:
capture log close
log using my_log_file.log, replace
clear all
use my_data
//do work
save using my_new_data, replace
log close
Let’s discuss each line in turn:
capture log close
tells Stata to close any open log files. If your do file crashes before it reaches the log close
command at the end, its log will stay open, getting in the way of future attempts to run the do file. The capture
prefix means Stata can ignore any errors the command generates. We use it here because running log close
when no log file is open generates an error and we want this to work whether a log file is open or not.
log using my_log_file.log, replace
tells Stata to store all the results generated by this do file (i.e. all the text that shows up in in the Results window) in my_log_file.log
. Of course you’ll want to give real log files better names. My suggestion is that you give each log file the same name as the do files that creates it, so it’s clear which one log file and which do file go together. Giving the file the extension .log
also tells Stata that you want the log to be in plain text format rather than the Stata Markup and Control Language (SMCL), which allows it to be used by other programs. The replace
option tells Stata it can replace old versions of the log—without it you can’t run the do file more than once.
clear all
tells Stata to clear out any data sets in memory, along with stored results, local macros, programs, etc. so the do file starts with a blank slate every time.
use my_data
loads your data set into memory (we’ll talk about alternative methods shortly). Having your do file clear whatever is in memory and then load your data fresh from disk every time it runs goes a long way toward giving you reproduciblility.
Now your do file is ready to do work. When it’s done doing its work, you’ll need to do one or two steps to wrap up.
save using my_new_data, replace
saves the data set in memory as my_new_data.dta
. The replace option again tells Stata it can overwrite old versions of the data set. You only need to do this if your do file makes changes to the data set that you want to save (data wrangling do files will; analysis do files generally will not).
Note that we’re saving the new data set with a new name, so the original data set is not changed. Never save your output data set over your input data set. If you do, you can never run your do file again, as the data set it was written to work with is gone.
log close
tells Stata to stop recording results in your log file. Without this command, anything you type after running the do file will be recorded in the do file’s log.
2.8 Reproducibility
You’re probably familiar in abstract with the importance of reproducibility to science: anyone with the proper training and equipment should be able to reproduce your experiment and get the same results. In practice we often use reproducibility as a rough proxy for truthfulness. If a result can be reproduced it’s true, and if it can’t be it’s not.
If your experiment consists of analyzing a publicly available data set, then it may seem like allowing people to reproduce it is very simple: just tell people what data you analyzed and how, and they can do the same thing. But if you’ve ever tried to reproduce a published paper you probably discovered it’s not that easy. There are many decisions to make in the process of preparing data for analysis and then analyzing it, and those decisions can have a big impact on the results. It’s very hard to describe those decisions precisely in human languages like English, or we wouldn’t bother with programming languages like Stata.
The solution is to publish your do files and your data as well as your results.
To eliminate possible sources of confusion and error, your code should be self-sufficient, meaning it contains everything needed to start with the raw data and end with your results. The ideal—and this is easily achievable—is that someone can click a single button and reproduce your entire research project. (That doesn’t mean it has to all go in a single do file. See Project Management for suggestions about how to organize your work into do files.)
But this isn’t just about helping others or some lofty ideal of how science should work. Reproducible do files make it easy to make changes and corrections. They reduce the probability of error (you’ll make mistakes, but when you fix them they’ll stay fixed). They help you keep your work organized. They allow you to reuse code when you run into a similar problem in the future. Even if you never shared your work on a project with anyone, you’d still benefit greatly from doing everything in reproducible do files. The goal of reproducibility underlies everything we will do in Data Wrangling in Stata.