Introduction to Stata: Creating and Changing Variables

This is part six of Introduction to Stata. If you’re new to Stata we highly recommend starting from the beginning.

This article will teach you the basics of making new variables, modifying existing variables, and creating labels.

Generate and Replace

The primary commands for creating and changing variables are generate (usually abbreviated gen) and replace (which, like other commands that can destroy information, has no abbreviation). gen creates new variables; replace changes the values of existing variables. Their core syntax is identical:

gen variable = expression

or

replace variable = expression

where variable is the name of the variable you want to create or change, and expression is the mathematical expression whose result you want to put in it. Expressions can be as simple as a single number or involve all sorts of complicated functions. You can explore what functions are available by typing help functions. If the expression depends on a missing value at any point, the result is missing. Usually this is exactly what you’d expect and want.

It’s especially important to use do files when you change your data, so start by creating a do file that loads the auto data set:

capture log close
log using vars.log, replace

clear all
sysuse auto

The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to 2020 dollars. To do so you need to multiply the prices by a conversion factor which is the Consumer Price Index in 2020 divided by the Consumer Price Index in 1978, or about 4. The code will be:

gen price2020 = price*4

Add this line to your do file, run it, and examine the results with:

browse make price price2020

The prices are still generally lower than you’d see at a car dealership, but that’s probably because today’s cars are much nicer than 1978 cars. This is a good example of how to check your work: compare what you got to what you expected, and if they don’t match make sure you know why!

Internally, Stata executed a loop: it calculated price*4 for the first observation and stored the result in price2020 for the first observation, then calculated price*4 for the second observation and stored the result in price2020 for the second observation, and so forth for all the observations in the data set. You’ll learn how to stretch this one-observation-at-a-time paradigm in Data Wrangling in Stata, but tasks that break it (like calculating means) require a different approach that we’ll talk about soon.

Suppose we wanted to be a little more precise and use 4.14 as the conversion factor. You might be tempted to try to add code that “fixes” the price2020 variable (say, multiplying it by 4.14/4). But it’s simpler and cleaner to fix the code that created it in the first place. Change:

gen price2020 = price*4

to:

gen price2020 = price*4.14

and run the do file again. Because your do file loads the original data from disk every time it is run, it can simply create the price2020 variable the way it should be.

Having both price and price2020 allowed you to compare their values and check your work. But if you only want to work with 2020 dollars and are confident you’ve got the formula right, you can use the replace command to change the existing price variable instead of creating a new one:

replace price = price*4.14

Run this version and you’ll get the message (74 real changes made). Given that the data set has 74 observations this tells you that all of them were changed, as you’d expect. Once you start including if conditions, how many observations were actually changed can be very useful information.

Exercise: Outside the United States, fuel efficiency is frequently measured in liters per kilometer (note that because the fuel used is in the numerator, a low number is good). To convert miles per gallon to liters per kilometer, multiply the reciprocal of mpg (1/mpg) by 2.35. Create a variable that stores the fuel efficiency of each car in liters per kilometer.

A LOT OF SKIPPED CONTENT HERE

String Variables

The gen and replace commands work with string variables too. The expressions on the right side of the equals sign are not mathematical, but they follow similar rules. String values always go in quotes, so if you wanted to store the letter x in a variable called x you’d say gen x = "x". Stata would not find this confusing (though you might) because x in quotes ("x") means the letter x and x without quotes means the variable x.

Addition for strings is defined as putting one string after the other, so “abc” + “def” = “abcdef”. But most work with strings is done by special-purpose functions that take strings as input (either string values or variables containing strings) and return strings as output.

The make variable really records two pieces of information: the name of the company that produced the car, and the name of the car model. You can easily extract the company name using the word() function:

gen company = word(make,1)

To see the results, run the do file and type browse make company. The first input, or argument, for the word() function is the string to act on (in this case a variable containing strings). The second is a number telling it which word you want. The function breaks the input string into words based on the spaces it contains, and returns the one you asked for, in this case the first.

We’ll say much more about string functions in Text Data (forthcoming), but if you’re eager to get started you can do a great deal with just the following functions:

function task
word() Extracts a word from a string
strpos() Tells you if a string contains another string, and if so its position
substr() Extracts parts of a string
subinstr() Replaces part of a string with something else
length() Tells you how long a string is (how many characters it contains)

Type help and then the name of a function in the main Stata window to learn how it works.

Exercise: Create a model variable containing the name of the car model (i.e. the rest of make). Your code must be able to handle model names that are either one or two words long.