This is part six of Introduction to Stata. If you’re new to Stata we highly recommend starting from the beginning.
This article will teach you the basics of making new variables, modifying existing variables, and creating labels.
The primary commands for creating and changing variables are generate
(usually abbreviated gen
) and replace
(which, like other commands that can destroy information, has no abbreviation). gen
creates new variables; replace
changes the values of existing variables. Their core syntax is identical:
gen variable = expression
or
replace variable = expression
where variable is the name of the variable you want to create or change, and expression is the mathematical expression whose result you want to put in it. Expressions can be as simple as a single number or involve all sorts of complicated functions. You can explore what functions are available by typing help functions
. If the expression depends on a missing value at any point, the result is missing. Usually this is exactly what you’d expect and want.
It’s especially important to use do files when you change your data, so start by creating a do file that loads the auto
data set:
capture log close
log using vars.log, replace
clear all
sysuse auto
The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to 2020 dollars. To do so you need to multiply the prices by a conversion factor which is the Consumer Price Index in 2020 divided by the Consumer Price Index in 1978, or about 4. The code will be:
gen price2020 = price*4
Add this line to your do file, run it, and examine the results with:
browse make price price2020
The prices are still generally lower than you’d see at a car dealership, but that’s probably because today’s cars are much nicer than 1978 cars. This is a good example of how to check your work: compare what you got to what you expected, and if they don’t match make sure you know why!
Internally, Stata executed a loop: it calculated price*4
for the first observation and stored the result in price2020
for the first observation, then calculated price*4
for the second observation and stored the result in price2020
for the second observation, and so forth for all the observations in the data set. You’ll learn how to stretch this one-observation-at-a-time paradigm in Data Wrangling in Stata, but tasks that break it (like calculating means) require a different approach that we’ll talk about soon.
Suppose we wanted to be a little more precise and use 4.14 as the conversion factor. You might be tempted to try to add code that “fixes” the price2020
variable (say, multiplying it by 4.14/4). But it’s simpler and cleaner to fix the code that created it in the first place. Change:
gen price2020 = price*4
to:
gen price2020 = price*4.14
and run the do file again. Because your do file loads the original data from disk every time it is run, it can simply create the price2020
variable the way it should be.
Having both price
and price2020
allowed you to compare their values and check your work. But if you only want to work with 2020 dollars and are confident you’ve got the formula right, you can use the replace
command to change the existing price
variable instead of creating a new one:
replace price = price*4.14
Run this version and you’ll get the message (74 real changes made)
. Given that the data set has 74 observations this tells you that all of them were changed, as you’d expect. Once you start including if
conditions, how many observations were actually changed can be very useful information.
Exercise: Outside the United States, fuel efficiency is frequently measured in liters per kilometer (note that because the fuel used is in the numerator, a low number is good). To convert miles per gallon to liters per kilometer, multiply the reciprocal of mpg (1/mpg) by 2.35. Create a variable that stores the fuel efficiency of each car in liters per kilometer.
The gen
and replace
commands work with string variables too. The expressions on the right side of the equals sign are not mathematical, but they follow similar rules. String values always go in quotes, so if you wanted to store the letter x in a variable called x
you’d say gen x = "x"
. Stata would not find this confusing (though you might) because x in quotes ("x"
) means the letter x and x without quotes means the variable x
.
Addition for strings is defined as putting one string after the other, so “abc” + “def” = “abcdef”. But most work with strings is done by special-purpose functions that take strings as input (either string values or variables containing strings) and return strings as output.
The make
variable really records two pieces of information: the name of the company that produced the car, and the name of the car model. You can easily extract the company name using the word()
function:
gen company = word(make,1)
To see the results, run the do file and type browse make company
. The first input, or argument, for the word()
function is the string to act on (in this case a variable containing strings). The second is a number telling it which word you want. The function breaks the input string into words based on the spaces it contains, and returns the one you asked for, in this case the first.
We’ll say much more about string functions in Text Data (forthcoming), but if you’re eager to get started you can do a great deal with just the following functions:
function | task |
---|---|
word() |
Extracts a word from a string |
strpos() |
Tells you if a string contains another string, and if so its position |
substr() |
Extracts parts of a string |
subinstr() |
Replaces part of a string with something else |
length() |
Tells you how long a string is (how many characters it contains) |
Type help
and then the name of a function in the main Stata window to learn how it works.
Exercise: Create a model
variable containing the name of the car model (i.e. the rest of make). Your code must be able to handle model names that are either one or two words long.