7  Variable Transformation With Hierarchical (Grouped) Data

Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is things are observed multiple times (panel/longitudinal data or repeated measures). These structures may seem very different, but the same concepts apply to both and often even the same code.

7.1 Hierarchical Data Concepts

In this section we’ll introduce some of the core concepts and vocabulary for thinking carefully about hierarchical data.

7.1.1 Levels

Hierarchical data can be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:

EXAMPLE DATA STRUCTURE LEVEL ONE UNIT LEVEL TWO UNIT
Individuals living in households Individuals Households
States grouped into regions States Regions
Students in classes Students Classes
Tests taken by students Tests Students
Monthly observations of individuals Monthly observations Individuals
Visits to the doctor by patients Doctor visits Patients
Social media posts by individuals Posts Individuals

If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.

Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.

EXAMPLE DATA STRUCTURE LEVEL ONE VARIABLES LEVEL TWO VARIABLES
Individuals living in households Age, Sex Household income, Address
States grouped into regions State population Regional income per capita
Students in classes Student’s race Teacher’s race, Class size
Tests taken by students Test Score Free lunch eligibility
Monthly observations of individuals Employment Status Highest degree earned
Visits to the doctor by patients BMI, Diagnosis Race
Social media posts by individuals Length, Topic Sex

Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it’s unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (monthly observations) have different values for it.

7.1.2 Data Forms

With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people for three months:

person_id month years_edu employed
1 1 16 True
1 2 16 True
1 3 16 True
2 1 12 False
2 2 12 True
2 3 12 True

Exercise 1

Identify the level one units, level two units, level one variable(s), and level two variable(s) in the above data set.

In this form, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:

person_id years_edu employed1 employed2 employed3
1 16 True True True
2 12 False True True

In this form, each observation represents a person. Because years_edu is a level two variable, there’s just one value per person and thus one variable. However, employed is a level one variable with three (potentially) different values per person. Thus it needs three variables.

We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more variables. In most cases the long form is easier to work with, so we’ll do most of our examples in this form.

Now consider the identifiers in this data set. In the long form, person_id and month were a compound identifier for the rows in the data set, while the variable names were a simple column identifier. In the wide form, person_id is a simple row identifier, but now the variable names for the level one variables are a compound identifier with two parts: the variable name (employed) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the person_id, month, and a variable name, but month has been converted from a row identifier to part of a compound column identifier.

7.2 Creating Level Two Variables

A very common data wrangling task is creating level two variables based on the level one variables in the data set. If the data set is in long form, one critical tool for doing so is the by prefix, so that commands are executed separately for each level two unit. Another is egen, with its aggregate functions that act across observations.

Start a do file called level2.do that loads acs.dta. This is one of the provided example data sets, but it’s the result of doing everything in the previous chapter so if you did everything in the previous chapter you can the data set you created instead if you want to.

capture log close
log using level2.log, replace

clear all
use acs

. capture log close

. log using level2.log, replace
-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  C:\Users\rdimond\dws\level2.log
  log type:  text
 opened on:  27 Dec 2022, 16:15:27

. 
. clear all

. use acs

. 

We suggest having the data browser open while you work on the examples so you can see the result of what you’re doing. The book will list the key variables for a few observations, but you can skip the list commands if you look at the browser instead.

7.2.1 Continuous Variables

We’ll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.

For many purposes, the total income of a household is more relevant than individual incomes. You can calculate it with a combination of by and the egen total() function:

by household: egen household_income = total(income)
list household person income household_income in 1/5, ab(30)

. by household: egen household_income = total(income)

. list household person income household_income in 1/5, ab(30)

     +------------------------------------------------+
     | household   person   income   household_income |
     |------------------------------------------------|
  1. |        37        1    10000              20000 |
  2. |        37        2     5300              20000 |
  3. |        37        3     4700              20000 |
  4. |       241        1    32500              32500 |
  5. |       242        1    30000              30000 |
     +------------------------------------------------+

. 

Now calculate the income per person. Since this is the same as the mean income for the household, you might think you can use:

by household: egen income_per_person_wrong = mean(income)

In fact this does not give the right answer, as you can see by examining household 484:

list household person income household_income ///
    income_per_person_wrong ///
    if household==484, ab(30)

. list household person income household_income ///
>     income_per_person_wrong ///
>     if household==484, ab(30)

       +------------------------------------------------+
   10. | household | person | income | household_income |
       |       484 |      1 |  16800 |            34800 |
       |------------------------------------------------|
       |            income_per_person_wrong             |
       |                              17400             |
       +------------------------------------------------+

       +------------------------------------------------+
   11. | household | person | income | household_income |
       |       484 |      2 |  18000 |            34800 |
       |------------------------------------------------|
       |            income_per_person_wrong             |
       |                              17400             |
       +------------------------------------------------+

       +------------------------------------------------+
   12. | household | person | income | household_income |
       |       484 |      3 |      . |            34800 |
       |------------------------------------------------|
       |            income_per_person_wrong             |
       |                              17400             |
       +------------------------------------------------+

       +------------------------------------------------+
   13. | household | person | income | household_income |
       |       484 |      4 |      . |            34800 |
       |------------------------------------------------|
       |            income_per_person_wrong             |
       |                              17400             |
       +------------------------------------------------+

. 

The reason is that young children have missing values for income,which prompts egen to completely ignore them in the calculation. Thus for household 484, income_per_person is the total income of the household divided by the number of people who have a non-missing value for income, two, when it should be total income divided by the total number of people in the household, four.

Fortunately, these egen functions can act on mathematical expressions, not just single variables. Since mean(x) = total(x)/N = total(x/N), we can instead use:

by household: egen income_per_person = total(income/_N)
list household person income household_income ///
    income_per_person_wrong income_per_person ///
    if household==484, ab(30)

. by household: egen income_per_person = total(income/_N)

. list household person income household_income ///
>     income_per_person_wrong income_per_person ///
>     if household==484, ab(30)

       +------------------------------------------------+
   10. | household | person | income | household_income |
       |       484 |      1 |  16800 |            34800 |
       |------------------------------------------------|
       | income_per_person_wrong  |  income_per_person  |
       |                   17400  |               8700  |
       +------------------------------------------------+

       +------------------------------------------------+
   11. | household | person | income | household_income |
       |       484 |      2 |  18000 |            34800 |
       |------------------------------------------------|
       | income_per_person_wrong  |  income_per_person  |
       |                   17400  |               8700  |
       +------------------------------------------------+

       +------------------------------------------------+
   12. | household | person | income | household_income |
       |       484 |      3 |      . |            34800 |
       |------------------------------------------------|
       | income_per_person_wrong  |  income_per_person  |
       |                   17400  |               8700  |
       +------------------------------------------------+

       +------------------------------------------------+
   13. | household | person | income | household_income |
       |       484 |      4 |      . |            34800 |
       |------------------------------------------------|
       | income_per_person_wrong  |  income_per_person  |
       |                   17400  |               8700  |
       +------------------------------------------------+

. 

_N is a system variable, or a variable Stata always keeps track of and you can use even though you never created it. It contains the number of observations in the data set, or, if you are using by, the number of observations in the by group. In this case that’s the number of people in the household. Sometimes that’s useful all by itself:

by household: gen household_size = _N
list household person household_size in 1/5, ab(30)

. by household: gen household_size = _N

. list household person household_size in 1/5, ab(30)

     +-------------------------------------+
     | household   person   household_size |
     |-------------------------------------|
  1. |        37        1                3 |
  2. |        37        2                3 |
  3. |        37        3                3 |
  4. |       241        1                1 |
  5. |       242        1                1 |
     +-------------------------------------+

. 

Because Stata numbers observations starting from 1, _N is also the observation number of the last observation. We’ll find that useful as well.

7.2.1.1 Exercise 2

Create a variable for the mean age of all the individuals in the household.

7.2.2 Subsetting with egen

Sometimes the way egen ignores missing values can be useful. Consider trying to calculate the total income earned by the children of a household. You might think you could run:

by household: egen child_income_wrong = total(income) if age<18
(20,118 missing values generated)

But examine household 8787:

list household person age income child_income_wrong if household==8787, ab(30)

       +--------------------------------------------------------+
       | household   person   age   income   child_income_wrong |
       |--------------------------------------------------------|
  179. |      8787        1    45    38000                    . |
  180. |      8787        2    16     3600                 3800 |
  181. |      8787        3    15      200                 3800 |
       +--------------------------------------------------------+

child_income_wrong has the right number, but it’s not a proper household-level variable in that the parent has a missing value instead. For many purposes that would make it unusable. The reason is that egen used the condition if age<18 for two different purposes, one that you want and one that you don’t. The first is that in calculating the total it only includes observations where age<18, which is exactly what you want. The second is that it only stores the result in observations where age<18, leaving the other observations with missing values. This is not what you want.

One solution is to first create a variable income_if_child containing only the values of income that you want to be included in the total (i.e. the children’s incomes). Observations that should not be included (i.e. the adults) get a missing value. Then you can add up all the values of that variable and store the result in all the observations. egen ignore the missing values, giving you exactly what you want:

gen income_if_child = income if age<18
by household: egen child_income = total(income_if_child)

. gen income_if_child = income if age<18
(26,262 missing values generated)

. by household: egen child_income = total(income_if_child)

. 
list household person age income income_if_child child_income if household==8787, ab(30)

       +--------------------------------------------------------------------+
       | household   person   age   income   income_if_child   child_income |
       |--------------------------------------------------------------------|
  179. |      8787        1    45    38000                 .           3800 |
  180. |      8787        2    16     3600              3600           3800 |
  181. |      8787        3    15      200               200           3800 |
       +--------------------------------------------------------------------+

Now child_income is a proper household-level variable, with the same value for all the members of the household. income_if_child is no longer needed can be dropped at your convenience.

Exercise 3

Create a level two variable called mean_adult_age containing the mean age of the adults in the household. Make sure it has the same value for all the members of the household, including any children.

7.2.3 Indicator Variables

When an indicator variable is coded one/zero, the egen functions used above take on new and useful meanings. For example, the total of an indicator variable is the number of observations for which the indicator is one. Create a variable containing the number of children in each household with:

gen child=(age < 18)
by household: egen num_children = total(child)
list household person age num_children if household==484, ab(30)

. gen child=(age < 18)

. by household: egen num_children = total(child)

. list household person age num_children if household==484, ab(30)

       +-----------------------------------------+
       | household   person   age   num_children |
       |-----------------------------------------|
   10. |       484        1    33              2 |
   11. |       484        2    26              2 |
   12. |       484        3     4              2 |
   13. |       484        4     2              2 |
       +-----------------------------------------+

. 

The mean of an indicator variable is the proportion of observations for which the indicator is one. Create a variable containing the proportion of each household that is below 18 with:

by household: egen prop_children = mean(child)
list household person age prop_children if household==484, ab(30)

. by household: egen prop_children = mean(child)

. list household person age prop_children if household==484, ab(30)

       +------------------------------------------+
       | household   person   age   prop_children |
       |------------------------------------------|
   10. |       484        1    33              .5 |
   11. |       484        2    26              .5 |
   12. |       484        3     4              .5 |
   13. |       484        4     2              .5 |
       +------------------------------------------+

. 

Next consider the maximum value of an indicator variable:

by household: egen has_children = max(child)
list household person age has_children if household==37 | household==484, ab(30)

. by household: egen has_children = max(child)

. list household person age has_children if household==37 | household==484, ab(
> 30)

       +-----------------------------------------+
       | household   person   age   has_children |
       |-----------------------------------------|
    1. |        37        1    20              0 |
    2. |        37        2    19              0 |
    3. |        37        3    19              0 |
   10. |       484        1    33              1 |
   11. |       484        2    26              1 |
       |-----------------------------------------|
   12. |       484        3     4              1 |
   13. |       484        4     2              1 |
       +-----------------------------------------+

. 

If a household has no children in it, then child is always zero and the maximum value is zero. If a household has any children in it, then those children have a one for child and the maximum value of child is one. Thus has_children is a household-level indicator variable for “this household has children in it.”

More generally, applying the max() function to an indicator variable creates a new indicator variable which is one for all observations (or all observations within a by group) if the original indicator variable is one for any observation.

You can use min() in the same way: the result will be one if the indicator you apply it to is one for all observations, but the result will be zero if any observation has a zero. Use that to create an indicator variable for “all the people in this household are children”:

by household: egen all_children = min(child)
list household person age all_children if household==484, ab(30)

. by household: egen all_children = min(child)

. list household person age all_children if household==484, ab(30)

       +-----------------------------------------+
       | household   person   age   all_children |
       |-----------------------------------------|
   10. |       484        1    33              0 |
   11. |       484        2    26              0 |
   12. |       484        3     4              0 |
   13. |       484        4     2              0 |
       +-----------------------------------------+

. 

Surprisingly enough, there are some households apparently consisting entirely of children:

list household person age if all_children

       +-------------------------+
       | househ~d   person   age |
       |-------------------------|
 1490. |    73731        1    15 |
 1491. |    73731        2    15 |
 1623. |    80220        1    17 |
       +-------------------------+

All of these functions can act on an expression rather than a variable, including conditions. Thus we could have used total(age<18), mean(age<18), max(age<18), etc. rather than creating the indicator variable child and gotten the same results.

This data set has no missing values for age, but it’s worth thinking about what this code would do with them. The condition age<18 returns a zero for observations where age is missing (recall that you can think of missing as infinity), as if the person were known to be an adult. Thus in the presence of missing values num_children isn’t really the number of children in the household, it’s the number of people known to be children. But that may be exactly what you want.

Exercise 4

Create variables containing the number of college graduates in each household (Associate’s degree or above), the proportion of the household members which are college graduates, and an indicator variable for “this household contains at least one college graduate.” Then create an indicator variable for “all the adults in this household are college graduates.” Remember that edu has missing values, but since the people with missing values of edu are all less than three years old you can safely assume they’re not college graduates.

7.3 Creating Level Two Variables in Wide Form

Start a new do file, level2_wide.do that loads acs_wide.dta:

capture log close
log using level2_wide.log, replace
clear all
use acs_wide

. capture log close

. log using level2_wide.log, replace
-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  C:\Users\rdimond\dws\level2_wide.log
  log type:  text
 opened on:  27 Dec 2022, 16:15:28

. clear all

. use acs_wide

. 

This is the exact same data set as before, but in wide form. Now there is just one row for each household, but it contains all the data about all the individuals in it. The income1 variable contains the income of the first person in the household, income2 the second, etc.

Now consider adding up the total income of the household. In wide form, instead of using the total() function we need the rowtotal() function. It adds things up just like total(), but while total() adds up the values of a single variable across multiple observations, rowtotal() adds up the values of multiple variables within a single observation.

However, the input rowtotal() needs is quite different. Rather than acting on a single mathematical expression, it acts on a list of variables, or varlist. When a Stata command or function takes a varlist this means both that it needs a list of variables and that it will understand certain shortcuts for specifying that list. In this case we want to act on all the income variables, but there are sixteen of them (one household has sixteen people in it) and typing them all out would be tiresome. So we’ll take a brief digression into shortcuts for specifying lists of variables.

7.3.1 Shortcuts for Variable Lists

The most common shortcut is to use the asterisk (*) as a wildcard character. Try:

describe income*

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
income1         long    %10.0g                1 income
income2         long    %10.0g                2 income
income3         long    %10.0g                3 income
income4         long    %10.0g                4 income
income5         long    %10.0g                5 income
income6         long    %10.0g                6 income
income7         long    %10.0g                7 income
income8         long    %10.0g                8 income
income9         long    %10.0g                9 income
income10        long    %10.0g                10 income
income11        long    %10.0g                11 income
income12        long    %10.0g                12 income
income13        long    %10.0g                13 income
income14        long    %10.0g                14 income
income15        long    %10.0g                15 income
income16        long    %10.0g                16 income

This tells the describe command to act on all variables that match the pattern “income followed by anything.” Note that “anything” can include nothing, so a variable just called “income” would be included as well. The wildcard can go anywhere:

describe *1

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
age1            byte    %10.0g                1 age
race1           byte    %25.0g     race_label
                                              1 race
marital_status1 byte    %13.0g     marital_status_label
                                              1 marital_status
edu1            byte    %24.0g     edu_label
                                              1 edu
income1         long    %10.0g                1 income
female1         float   %9.0g                 1 female
hispanic1       float   %9.0g                 1 hispanic
age11           byte    %10.0g                11 age
race11          byte    %25.0g     race_label
                                              11 race
marital_stat~11 byte    %13.0g     marital_status_label
                                              11 marital_status
edu11           byte    %24.0g     edu_label
                                              11 edu
income11        long    %10.0g                11 income
female11        float   %9.0g                 11 female
hispanic11      float   %9.0g                 11 hispanic

This matches all the variables with information about the first individual in the household, but also the variables with information about the eleventh individual. Be careful your wildcards don’t match more than what you want!

A question mark (?) is also a wildcard, but it matches exactly one character:

describe income?

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
income1         long    %10.0g                1 income
income2         long    %10.0g                2 income
income3         long    %10.0g                3 income
income4         long    %10.0g                4 income
income5         long    %10.0g                5 income
income6         long    %10.0g                6 income
income7         long    %10.0g                7 income
income8         long    %10.0g                8 income
income9         long    %10.0g                9 income

This matches income1 through income9, but not income10 because it is income followed by two characters.

Another shortcut is to put a dash (-) between two variables. This will give you all the variables in between them:

describe age1-hispanic1

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
age1            byte    %10.0g                1 age
race1           byte    %25.0g     race_label
                                              1 race
marital_status1 byte    %13.0g     marital_status_label
                                              1 marital_status
edu1            byte    %24.0g     edu_label
                                              1 edu
income1         long    %10.0g                1 income
female1         float   %9.0g                 1 female
hispanic1       float   %9.0g                 1 hispanic

This gives you just the variables with information about the first individual. The order used in resolving this shortcut is the order the variables are listed in the variables window or a describe command. You can use the order command to change the order of the variables to something convenient.

A varlist can mix multiple kinds of shortcuts as well as individual variable names:

describe household age1-hispanic1 income*

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
household       double  %8.0g                 Household serial number
age1            byte    %10.0g                1 age
race1           byte    %25.0g     race_label
                                              1 race
marital_status1 byte    %13.0g     marital_status_label
                                              1 marital_status
edu1            byte    %24.0g     edu_label
                                              1 edu
income1         long    %10.0g                1 income
female1         float   %9.0g                 1 female
hispanic1       float   %9.0g                 1 hispanic
income1         long    %10.0g                1 income
income2         long    %10.0g                2 income
income3         long    %10.0g                3 income
income4         long    %10.0g                4 income
income5         long    %10.0g                5 income
income6         long    %10.0g                6 income
income7         long    %10.0g                7 income
income8         long    %10.0g                8 income
income9         long    %10.0g                9 income
income10        long    %10.0g                10 income
income11        long    %10.0g                11 income
income12        long    %10.0g                12 income
income13        long    %10.0g                13 income
income14        long    %10.0g                14 income
income15        long    %10.0g                15 income
income16        long    %10.0g                16 income

7.3.2 Using Row Functions

Many of the tasks we carried out in long form can easily be done in wide form, with three changes:

  • Remove the by prefix. In wide form a level 2 group is a single observation, not a group of them.
  • Replace the egen function with its row equivalent.
  • Replace the variable to act on with a variable list matching all the corresponding level one variables.

Thus the long form command:

by household: egen household_income = total(income)

becomes in wide form:

egen householdIncome = rowtotal(income*)

However, row functions cannot take expressions as arguments, like (age < 18). You can take these expressions and turn them into variables, but this turns out to be complicated in wide form.

7.3.3 Creating Level 1 Variables in Wide Form

In this section we’ll demonstrate very briefly how to write loops in Stata. For a broader and deeper introduction to looping, see Stata Programming Essentials. Definitely read it if you find this brief introduction confusing.

In long form, we ran gen child = (age < 18) to create an indicator for “this person is a child.” But in wide form we need child1 for “person 1 in this household is a child”, child2 for “person 2 in this household is a child” up through child16. Writing out 16 gen commands would be straightforward, but tedious. It’s much easier to use a loop:

forvalues i = 1/16 {
    gen child`i' = (age`i' < 18) if age`i' < .
}

. forvalues i = 1/16 {
  2.     gen child`i' = (age`i' < 18) if age`i' < .
  3. }
(2,779 missing values generated)
(6,191 missing values generated)
(7,876 missing values generated)
(9,355 missing values generated)
(10,115 missing values generated)
(10,388 missing values generated)
(10,484 missing values generated)
(10,531 missing values generated)
(10,544 missing values generated)
(10,554 missing values generated)
(10,557 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)

. 

Consider each line in turn:

forvalues i = 1/16 {

forvalues tells Stata we’re going to carry out the following code for a set of values (i.e. execute the code once for each value). 1/16 defines the set of values as the integers 1 through 16 (Stata knows from context that we’re not doing division here). Each value is stored in the macro i. Finally, the { tells Stata this is where the code to be executed begins.

gen child`i' = (age`i' < 18) if age`i' < .

Most of this is familiar, except for `i'. This is how you use the values of a macro. When Stata sees a name that starts with a left single quote (backtick) and ends with a right single quote (apostrophe) it will replace the name with the corresponding value of the macro before it executes the code. In this case, the forvalues loop defines the macro `i' to contain first the number 1, then the number 2, and so forth all the way up to 16. So what Stata actually runs is:

gen child1 = (age1 < 18) if age1 < .
gen child2 = (age2 < 18) if age2 < .
...
gen child16 = (age16 < 18) if age16 < .

Note that in long form age had no missing values, but in wide form the age variables will have missing values for all the people who don’t exist (e.g. if a household has three people in it age4 and higher will be missing). It’s very important that the child variable also be missing for these people.

Finally, } tells Stata this is the end of the code to be executed. It must go on its own line.

This that we have a set of indicator variables for “this person is a child”, we can use row functions to carry out the same tasks we carried out in long form. For example, create variables for the number of children in the household and an indicator for “this household has children” with:

egen num_children = rowtotal(child*)
egen has_children = rowmax(child*)

. egen num_children = rowtotal(child*)

. egen has_children = rowmax(child*)

. 

Note how num_children and has_children are single variables, not sets of variables like child1, child2 etc. That’s because child is a person-level variable with one value per person in the household, while num_children and has_children are household-level variables with one value each for the entire household.

Exercise 5

Use a loop to create a set of person-level indicator variables for “this person is black.” Then create household-level indicator variables for “this household has at least one black member” and “all the members of this household are black.”

7.4 Panel Data

Panel data, or longitudinal data, are data where subjects are observed repeatedly over time and the timing is important. If timing isn’t important then we call it repeated measures data. The National Longitudinal Survey of Youth is an example of panel data, and we’ll use a small extract from it as an example. (Note that this extract combines income variables from different years with slightly different definitions into a single income variable, so you really wouldn’t want to use this extract for actual research.) Create a do file called panel.do that loads it:

capture log close
log using panel.txt, replace
clear all
use nlsy_extract

. capture log close

. log using panel.txt, replace
-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  C:\Users\rdimond\dws\panel.txt
  log type:  smcl
 opened on:  27 Dec 2022, 16:15:28

. clear all

. use nlsy_extract

. 

Exercise 6

As a review, spend some time getting familiar with this data set. In particular:

  • Identify the identifier variables. What is a level one unit in this data set? What is a level two unit?
  • Which variables are level one variables? Which are level two variables?
  • What type of variable is edu?
  • What does it suggest about the data collection process that income and edu are frequently missing for the same observation? What does it tell you about age that it is never missing?

Most of the techniques we learned for working with individuals in households carry over directly to panel data. For example, to find the total income earned during the study period, run:

by id: egen total_income = total(income)

To find the age of the subject the first time they appear in the study, run:

by id: egen starting_age = min(age)

But this relies on the fact that age always increases over time. What if you wanted to know their income the first time they appear in the study? Recall that income[1] means “the value of income for the first observation.” When combined with by it means “the value of income for the first observation in the by group.” This is highly convenient, but you need to make sure that the observations for each subject are in chronological order so their first observation really is the first time they appear in the study:

sort id year
by id: gen startingIncome = income[1]

. sort id year

. by id: gen startingIncome = income[1]
(109,953 missing values generated)

. 

You need to be careful because Stata’s default sorting algorithm is not stable. This means it will put ties in whatever order will make it run fastest. So if you run sort id, or bysort id:, the observations for each person could be in any order. In practice, if the data are already sorted or mostly sorted the order that will make the sort run fastest is usually to leave things alone. But you can’t count on that. So if you’re going to run code that depends on the sort order, like assuming the first observation is the first chronologically, be sure the data are actually in the right order.

7.4.1 Exercise 7

Create ending_income, the subject’s income in the last year of the study. Recall that _N is the observation number of the last observation. (Yes, for many people ending_income will be missing.)

7.4.2 Events and Event History

Often with panel data you’ll need to identify particular events or sequences of events. For example, suppose you need to identify the year in which each subject graduated from high school. A subject graduated from high school in a given year if they have 12 years of education in that year and less than 12 years of education the year before.

Normally Stata works row by row. If you say:

gen y = x

you’re saying “set the value of y for each observation to the value of x for the same observation. As it does so, Stata keeps track of which observation it is working on in a system variable called _n. So x[_n] is just the same as x: the value of x for the current observation. But x[_n-1] means”the value of x for the observation before the current observation.” And that’s extremely useful. In particular, you can detect the year a person graduates from high school using the rule “the person has 12 years of education this year and less than 12 years of education the year before with:

by id: gen grad = (edu==12 & edu[_n-1]<12)

To check your work, calculate the number of times this rule identified each person as having graduated from high school:

by id: egen times_grad = total(grad)
tab times_grad

. by id: egen times_grad = total(grad)

. tab times_grad

 times_grad |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    135,033       56.02       56.02
          1 |    106,001       43.98      100.00
------------+-----------------------------------
      Total |    241,034      100.00

. 

Note that the frequencies above do not indicate how many people graduated one or zero times, but how many person-year combinations belong to people who graduate one or zero times. Be careful about summary statistics with hierarchical data!

The zeroes are not surprising: some subjects graduated before the study began, or didn’t graduate at all, or missing data prevented you from identifying the year they graduated. Values of two or more would indicate a problem, either with your code or with the data, so it’s good you don’t have any.

Speaking of missing data, what does the code for creating grad do with it? Missing values are neither 12 nor less than 12, so the condition (edu==12 & edu[_n-1]<12) will be false if edu is missing for either the current observation or the observation before it. So grad==1 can be interpreted as “we know the person graduated this year” but grad==0 could mean either “we know the person didn’t graduate this year” or “we don’t know if the person graduated this year or not because of missing data.”

In the command:

by id: egen times_grad = total(grad)

the by id: prefix tells the total() function which observations it should aggregate when calculating the total: all the ones with the same value of id. But the command:

by id: gen grad = (edu==12 & edu[_n-1]<12)

doesn’t do any aggregating. So why do we need by id: there?

Consider the first observation for the second person. edu[_n-1] is the value of edu for the observation before it. Without by id:, that would be the last value of edu for person 1. That’s not what we want! by id: tells Stata to treat each group defined by id as if it were its own data set. Thus the first observation for each person has _n equal to 1 and there’s no observation before it–_n-1 is zero. Some languages would give you an error like “index out of bounds” if you asked for the value of edu for the zeroeth observation, but Stata just returns a missing value. That means grad is always 0 for the first observation for each person, which makes sense. If someone starts the study with 12 years of education we can’t tell if they just graduated from high school or have had 12 years of education for some time.

Exercise 8

Identify the year in which the subject started college (if they did). For simplicity, assume everyone starts college in the fall and only starts claiming to have completed a year of college in the following year. (Hint: if someone’s values of education for three years were 11, 12, and 13, they both graduated from high school and started college in the middle year.) Some people in the NLSY report that their education level increased by more than one year in a single year, so make sure your code allows for that.

Check your work by seeing how many times people started college. If you find people who started more than once, determine whether this is due to a problem with your code or with the data.

7.4.3 Level 2 Variables Based on a Special Level 1 Observation

Sometimes you need to create a level two variable which is just the value of a level one variable for a particular level one unit, such as the subject’s age at the time they graduated from high school. Here’s one way to do that:

sort id grad
by id: gen age_at_grad = age[_N] if grad[_N]
sort id year

. sort id grad

. by id: gen age_at_grad = age[_N] if grad[_N]
(135,033 missing values generated)

. sort id year

. 

Sorting by id and grad puts the observation in which the subject graduated last, so you can get the subject’s age in that year with age[_N]. However, recall that for many subjects we could not identify the year they graduated. For them, grad is always zero, which observation is last after the sort is completely arbitrary (remember Stata’s default sort is not stable), and age_at_grad should be missing. The condition if grad[_N] ensures that age_at_grad is not set if the last observation is not a a year in which they graduated.

Don’t forget to put the data back in chronological order with sort id year if that’s how you’re used to using it.

Alternatively you can treat this as a subsetting problem and use the techniques described above for calculations on subsets:

gen age_if_grad = age if grad
by id: egen age_at_grad2 = mean(age_if_grad)

. gen age_if_grad = age if grad
(235,455 missing values generated)

. by id: egen age_at_grad2 = mean(age_if_grad)
(135,033 missing values generated)

. 

Since people graduate at most once, the subset of observations where grad is 1 (true) has just one observation per person. The mean of one number is just that number. Thus we’re really using the mean function to “broadcast” that number to all the observations for the same person.

These methods give exactly the same result, so you can use whichever you prefer. The sorting method will be slower with large data sets, but there are many things you can do with a clever sort and we wanted you to have that in your bag of tricks.

Exercise 9

Create a person-level variable for “income at age 25.”