5 Variable Transformations With Numeric Data

In this chapter we’ll discuss variable transformations: creating and changing variables. You should already be familiar with the basics of variable transformations–if you’re not, Introduction to Stata: Creating and Changing Variables will get you up to speed. This chapter will add to your “bag of tricks” for working with numeric data.

5.1 Setting Up

Start a do file called numeric_transforms.do that loads acs.dta.

capture log close
log using transforsms.log, replace

clear all
use acs

-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/r/rdimond/kb/dws/transforsms.log
  log type:  text
 opened on:   6 Jul 2023, 22:26:44

5.2 Transformations and Missing Data

If an observation has a missing value for a variable, it should usually, but not always, have a missing value for any transformations of that variable.

Arithmatic operators (addition, subtraction, etc.) do this automatically. For example, to correct the incomes in this year 2000 ACS data set for inflation, we can conver them to (January) 2023 dollars by multipying them by 1.77. But look at the result for household 484:

gen income_2023 = income * 1.77
list household person age income income_2023 if household==484, ab(30)

(6,144 missing values generated)

       +-------------------------------------------------+
       | household   person   age   income   income_2023 |
       |-------------------------------------------------|
   10. |       484        1    33    16800         29736 |
   11. |       484        2    26    18000         31860 |
   12. |       484        3     4        .             . |
   13. |       484        4     2        .             . |
       +-------------------------------------------------+

The two children in the household have missing values for income, so they automatically got a missing value for income_2023. This is almost always what you want.

However, egen functions behave differently. In general, they act on the observed data and ignore the missing data. For example, consider calcualting household incomes using the egen total function (and by):

by household: egen household_income = total(income)
list household person age income household_income if household==484, ab(30)


       +------------------------------------------------------+
       | household   person   age   income   household_income |
       |------------------------------------------------------|
   10. |       484        1    33    16800              34800 |
   11. |       484        2    26    18000              34800 |
   12. |       484        3     4        .              34800 |
   13. |       484        4     2        .              34800 |
       +------------------------------------------------------+

(For much more on by and working with groups like this read the chapter Variable Transformation With Hierarchical (Grouped) Data.)

The total function added up all the observed values of income and ignored the missing values. This had the effect of treating the missing values as zeros, but that isn’t a general rule. The mean function will calculate the mean of the observed values, which has the effect of treating missing values as if they were at the mean. Be sure to think through whether the behavior of egen makes sense for your particular task.

Conditions never return a missing value, even if they involve missing values. For example, consider creating an indicator for high_income with the threshold for high being $50,000:

gen high_income_oops = (income > 50000)
list household person age income high_income_oops if household==484, ab(30)


       +------------------------------------------------------+
       | household   person   age   income   high_income_oops |
       |------------------------------------------------------|
   10. |       484        1    33    16800                  0 |
   11. |       484        2    26    18000                  0 |
   12. |       484        3     4        .                  1 |
   13. |       484        4     2        .                  1 |
       +------------------------------------------------------+

Oops.

Recall that the way Stata handles missing values is to designate the 27 largest possible values of each variable type as the various flavors of missing (., .a, .b, .c, up though .z). But when it comes to conditions (unlike arithmatic operators or egen functions) missing values are just treated like the big numbers they really are. Thus the following inequalities hold:

Missing Values and Inequalities

Any observed value or normal number < . < .a < .b < .c … < .x < .y < .z

That’s why the children, with their missing values for income got a 1 for high_income: to Stata . is biggger than 50000 or any other number.

The solution is to exclude observations with missing values from the process of creating the new variable entirely. Thus they’ll be left with a missing value, which is what you want.

gen high_income= (income > 50000) if income < .
list household person age income high_income if household==484, ab(30)

(6,144 missing values generated)

       +-------------------------------------------------+
       | household   person   age   income   high_income |
       |-------------------------------------------------|
   10. |       484        1    33    16800             0 |
   11. |       484        2    26    18000             0 |
   12. |       484        3     4        .             . |
   13. |       484        4     2        .             . |
       +-------------------------------------------------+

Why income < . rather than income != .? Because income < . excludes all the missing values (., .a, .b, etc.), not just ..

However, subtle differences in what you want your variable to mean can change how you handle missing values. Suppose you wanted an indicator for “this person is known to have a high income.” Now anyone with a missing value for income should have a 0 for the new indicator because they are not known to have a high income. But you can’t just ignore the missing values, or people with a missing value will get a 1. Instead, a person should get a 1 for known_high_income if their income is both high and known:

gen known_high_income = (income > 50000) & (income < .)
list household person age income known_high_income if household==484, ab(30)


       +-------------------------------------------------------+
       | household   person   age   income   known_high_income |
       |-------------------------------------------------------|
   10. |       484        1    33    16800                   0 |
   11. |       484        2    26    18000                   0 |
   12. |       484        3     4        .                   0 |
   13. |       484        4     2        .                   0 |
       +-------------------------------------------------------+

5.3 Conditional Transformations (Transformations that Vary for Subsets)

Often a variable transformation must be different for different subsets of the data. For example, suppose you wanted to define a counterfactual income for each person which is:

17000 + 100 * age if they do not have a bachelor’s degree
30000 + 500 * age if they do have a bachelor’s degree or higher

(Note that if those numbers came from running a regression model you could use the predict command to automatically calculate its predicted values. Also note that it would be a terrible model.)

We’ll explore several ways of creating this counterfactual income (cfi) variable.

First off, let’s remind ourselves how edu is defined so we can figure out how to identify people with bachelor’s degrees or higher. Run describe edu to get the name of the value labels associated with it:

describe edu


Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
edu             byte    %24.0g     edu_label
                                              Educational Attainment

Next use label list edu_label to see what the labels mean:

label list edu_label

edu_label:
           0 Not in universe
           1 None
           2 Nursery school-4th grade
           3 5th-6th grade
           4 7th-8th grade
           5 9th grade
           6 10th grade
           7 11th grade
           8 12th grade, no diploma
           9 High School graduate
          10 Some college, <1 year
          11 Some college, >=1 year
          12 Associate degree
          13 Bachelor's degree
          14 Master's degree
          15 Professional degree
          16 Doctorate degree

The presense of the “Not in universe” label is a good reminder that the Census Bureau did not collect education data for people under the age of three, so we set those values to missing. But let’s assume that no one under the age of three can possibly have a bachelor’s degree.

Thus you can identify people with a bachelor’s degree or higher with (edu >= 13) & (edu < .). The second condition excludes people with a missing value for edu. Without it they’d be included because, to Stata, . is greater than 13.

The direct way to calculate this is using the cond() function. The cond() function takes three arguments:

A condition
The result if the condition is true
The result if the condition is false

In this case, the first argument will be edu>=13 & edu<.. If that is true, then the person has a bachelor’s degree or higher and the formula for cfi is 30000 + 500*age, so that’s the second argument. If that is false, the person does not have a bachelor’s degree and the formula is 17000 + 100*age, so that’s the third argument.

gen cfi1 = cond( ///
    edu>=13 & edu<., ///
    30000 + 500*age, ///
    17000 + 100*age ///
)

list household person income age edu cfi1 in 1/5


     +-------------------------------------------------------------------+
     | househ~d   person   income   age                      edu    cfi1 |
     |-------------------------------------------------------------------|
  1. |       37        1    10000    20   Some college, >=1 year   19000 |
  2. |       37        2     5300    19   Some college, >=1 year   18900 |
  3. |       37        3     4700    19   Some college, >=1 year   18900 |
  4. |      241        1    32500    50          Master's degree   55000 |
  5. |      242        1    30000    29        Bachelor's degree   44500 |
     +-------------------------------------------------------------------+

This is a great way to carry out this transformation as long as you remembe the cond() function, and eventually you will.

But it’s easy to do this transformation using basic math if you break it into multiple steps. Note that we can recast:

17000 + 100*age if they do not have a bachelor’s degree
30000 + 500*age if they do have a bachelor’s degree or higher

as:

17000 + 100*age for everyone
Add 13000 + 400*age if they have a bachelor’s degree or higher

In other words, a base amount for everyone and then an additional amount for peope with bachelor’s degrees. That can be implemented with:

gen cfi2 = 17000 + 100*age

replace cfi2 = cfi2 + 13000 + 400*age if edu>=13 & edu<.

list household person income age edu cfi2 in 1/5

(4,485 real changes made)

     +-------------------------------------------------------------------+
     | househ~d   person   income   age                      edu    cfi2 |
     |-------------------------------------------------------------------|
  1. |       37        1    10000    20   Some college, >=1 year   19000 |
  2. |       37        2     5300    19   Some college, >=1 year   18900 |
  3. |       37        3     4700    19   Some college, >=1 year   18900 |
  4. |      241        1    32500    50          Master's degree   55000 |
  5. |      242        1    30000    29        Bachelor's degree   44500 |
     +-------------------------------------------------------------------+

Setting a variable equal to itself plus something is a very common and useful pattern.

Now consider this alternative:

gen bachelors = (edu>=13 & edu<.)

gen cfi3 = 17000 + 100*age + bachelors*(13000 + 400*age)

list household person income age edu cfi3 in 1/5


     +-------------------------------------------------------------------+
     | househ~d   person   income   age                      edu    cfi3 |
     |-------------------------------------------------------------------|
  1. |       37        1    10000    20   Some college, >=1 year   19000 |
  2. |       37        2     5300    19   Some college, >=1 year   18900 |
  3. |       37        3     4700    19   Some college, >=1 year   18900 |
  4. |      241        1    32500    50          Master's degree   55000 |
  5. |      242        1    30000    29        Bachelor's degree   44500 |
     +-------------------------------------------------------------------+

This relies on the fact that the indicator variable bachelors will be 1 if the condition edu>=13 & edu<. is true and 0 if it is false. Thus if the person has a bachelors degree, the additional income (13000 + 400*age) is multiplied by 1 and added to cfi2, but if they do not it is multiplied by zero and disappears.

Getting used to this kind of logic is very helpful for interpreting regression models with indicator or categorical variables.

You don’t have to create the indicator variable bachelors if you don’t want to:

gen cfi4 = 17000 + 100*age + (edu>=13 & edu<.)*(13000 + 400*age)

list household person income age edu cfi3 in 1/5


     +-------------------------------------------------------------------+
     | househ~d   person   income   age                      edu    cfi3 |
     |-------------------------------------------------------------------|
  1. |       37        1    10000    20   Some college, >=1 year   19000 |
  2. |       37        2     5300    19   Some college, >=1 year   18900 |
  3. |       37        3     4700    19   Some college, >=1 year   18900 |
  4. |      241        1    32500    50          Master's degree   55000 |
  5. |      242        1    30000    29        Bachelor's degree   44500 |
     +-------------------------------------------------------------------+

Stata recognizes that there’s a condition in your mathematical expression and evaluates it as either 1 or 0 automatically. This is very concise, but it’s hard to read because you have to recognize that part of it is a condition. If you use this code like this, definitely add a comment that explains how it works and points out the condition. But I’d suggest using one of the alternatives–it’s more important that your code be clear than it be concise.

5.4 Conditions with Many Values

The values of edu that mean “bachelor’s degree or higher” are 13, 14, 15, and 16. It’s highly convenient to specify them using the inequalities (edu >= 13) & (edu < .) and by all means you should do that where possible. But sometimes the values you need can’t be easily described that way. You could specify them with a series of conditions combined with logical or:

(edu==13) | (edu==14) | (edu==15) | (edu==16)

You’re probably thinking “Surely there’s an easier way” and there is: the inlist() function. The first argument of inlist() is a variable, and the rest are possible values of that variable. The result is 1 (true) if the actual value of the variable is in the list, and 0 (false) if it is not. Use it wherever you would use a condition, including inside mathematical expressions:

gen cfi5 = 17000 + 100*age

replace cfi5 = cfi2 + 13000 + 400*age if inlist(edu, 13, 14, 15, 16)

gen cfi6 = 17000 + 100*age + inlist(edu, 13, 14, 15, 16)*(13000 + 400*age)

list household person income age edu cfi6 in 1/5, ab(30)

(4,485 real changes made)

     +--------------------------------------------------------------------+
     | household   person   income   age                      edu    cfi6 |
     |--------------------------------------------------------------------|
  1. |        37        1    10000    20   Some college, >=1 year   19000 |
  2. |        37        2     5300    19   Some college, >=1 year   18900 |
  3. |        37        3     4700    19   Some college, >=1 year   18900 |
  4. |       241        1    32500    50          Master's degree   55000 |
  5. |       242        1    30000    29        Bachelor's degree   44500 |
     +--------------------------------------------------------------------+

Note that you can use inlist() with a string variable, in which case the values must be strings.

The related inrange() function takes three arguments: a variable, and the beginning and end of a range. It returns 1 if the variable is in the range specified and 0 otherwise. Thus inrange(edu, 13, 16) is equivalent to (edu >= 13) & (edu <= 16).

5.5 Creating Quantiles

Dividing a data set into quantiles is a very common task, and is easily carried out using the xtile command. For example, you can assign the adults in the ACS to their income quintile with:

xtile inc_quint = income if age>=18, n(5)

The n() option specifies the number of quantiles.

The resulting will be an ordered categorical variable, frequently used for grouping with the by prefix or with over() in graphs. For example, you can create a box plot of income over the five quintiles with:

graph box income, over(inc_quint)

5.6 Creating Categorical Variables From Cutpoints

An alternative way to use xtile is to define a categorical variable based on a continuous varialble with cutpoints that define the categories. As an example, we’ll use it to categorize age by decade.

The cutpoints must be put in a variable in the data set, normally one you’ll create just for this purpose, but the values won’t actually belong to the observations they’re attached to. Rather the value of the cutpoint for the first observation will be the maximum value for the first category, the value for the second observation will be the maximum for the second category, etc. The maximum value is included in the category (i.e. the rule is value <= max). You should have more observations than categories, so the rest will get missing values. Normally you’ll drop this variable as soon as you’re done with it.

For decades we want the cutpoints to be 9, 19, 29 up through 99 (recall that in the ACS 93 really means “93 or older”). You can do that with:

gen cutpoints = _n*10 - 1 if _n<=10
list cutpoint in 1/12, ab(30)

(27,400 missing values generated)

     +-----------+
     | cutpoints |
     |-----------|
  1. |         9 |
  2. |        19 |
  3. |        29 |
  4. |        39 |
  5. |        49 |
     |-----------|
  6. |        59 |
  7. |        69 |
  8. |        79 |
  9. |        89 |
 10. |        99 |
     |-----------|
 11. |         . |
 12. |         . |
     +-----------+

Recall that _n is the observation number.

You can set competely arbitrary cutpoints if you need to. This is one of the few places where you might use in for actual work. For example, if you wanted your first category to be everyone below 18 and your second to be everyone betweeen 18 and 30, you could do:

gen cutpoints = 18 in 1
replace cutpoints = 30 in 2

You can now run xtile with the cut() option, which tells it to use cutpoints instead of quantiles and the name of the variable that contains the cutpoints.

xtile decade = age, cut(cutpoints)

list age decade in 1/5


     +--------------+
     | age   decade |
     |--------------|
  1. |  20        3 |
  2. |  19        2 |
  3. |  19        2 |
  4. |  50        6 |
  5. |  29        3 |
     +--------------+

The categories seem off because they are numbered starting with 1, so right now a 1 for decade means ages 0-9, 2 means ages 10-19, etc. While the values of a categorical variable are completely arbitrary in theory, if you subtract 1 and then multiply by 10, 0-9 will be 0, 10-19 will be 10, 20-29 will be 20, etc. and that’s a good bit more intuitive.

replace decade = (decade - 1)*10

list age decade in 1/5

(27,410 real changes made)

     +--------------+
     | age   decade |
     |--------------|
  1. |  20       20 |
  2. |  19       10 |
  3. |  19       10 |
  4. |  50       50 |
  5. |  29       20 |
     +--------------+

Now it’s ready for use, for example plotting the mean income by decade:

graph hbar income, over(decade)

5.7 Extracting Digits From Numbers

You may be thinking “That’s nice, but couldn’t we have gotten decade by just dividing age by 10?” Almost: you want to divide by ten and get rid of the fractional part. You can do that by telling Stata that the new variable you’re creating should be an integer:

gen int decade2 = age/10

list age decade2 in 1/5


     +---------------+
     | age   decade2 |
     |---------------|
  1. |  20         2 |
  2. |  19         1 |
  3. |  19         1 |
  4. |  50         5 |
  5. |  29         2 |
     +---------------+

Stata will do the division normally but, because the decade2 variable is an integer, when it comes time to store the result any fractional part is discaded. Thus if age/10 is 19, only the 1 is stored. This integer division is surprisingly useful.

Note that this time ages 0-9 are decade 0, but you probably still want to multiply by 10 so 20-29 is 20 instead of 2.

replace decade2 = decade * 10

list age decade in 1/5

(23,448 real changes made)

     +--------------+
     | age   decade |
     |--------------|
  1. |  20       20 |
  2. |  19       10 |
  3. |  19       10 |
  4. |  50       50 |
  5. |  29       20 |
     +--------------+

If you wanted just the last digit of age, you could get that with the mod() (modulus) function. It takes two arguments and returns the remainder when the first is divided by the second:

gen year = mod(age, 10)
list age year in 1/5


     +------------+
     | age   year |
     |------------|
  1. |  20      0 |
  2. |  19      9 |
  3. |  19      9 |
  4. |  50      0 |
  5. |  29      9 |
     +------------+

The combination of integer division and modulus allows you to split numbers up at will. For example, the FIPS code for Dane County, Wisconsin is 55025, where 55 is the code for Wisconsin. But if you want to do analysis at the state and county level, you may need separate state and county variables. You can do that with:

gen fips = 55025

gen int state = fips/1000

gen county = mod(fips, 1000)

list fips state county in 1


     +------------------------+
     |  fips   state   county |
     |------------------------|
  1. | 55025      55       25 |
     +------------------------+

(Note that for many purposes you can use state and the original FIPS code, as the original code unqiquely identifies counties.)

The county variable does not include a leading zero–numeric variables never do. Most of the time that’s not a problem: FIPS codes do not use 25 and 025 to mean different counties. You can tell Stata to print a numeric variable with leading zeros. But if you really need 25 and 025 to mean different things store the variable as a string.

::{.callout-note} ### Exercise

Given the telephone number 6082629917, extract the area code. :::

5.8 Rounding

You might be tempted to think of the integer division gen int state = fips/1000 as “rounding to the nearest 1000”, but it’s actually quite different from rounding. You’ll see why if you try to generate decade by rounding to the nearest 10. The round() function takes two arguments: the number to be rounded and the second the number to round to.

gen decade3 = round(age, 10)

list age decade3 in 1/5


     +---------------+
     | age   decade3 |
     |---------------|
  1. |  20        20 |
  2. |  19        20 |
  3. |  19        20 |
  4. |  50        50 |
  5. |  29        30 |
     +---------------+

Integer division set decade2 to 1 for observation 2: dividing 19 by 10 gives 1.9, but the .9 is discarded. Rounding (decade3) rounded observation 2’s age up to 20. Which one is right depends on what you’re trying to do.