9  Restructuring Data Sets

In this chapter you’ll learn how to restructure data sets. You’ll learn how reshape data so you can switch at will between the long form (one row per level one unit) and wide form (one row per level two unit). You’ll also learn how to turn a data set containing both level one and level two units into a data set containing just level two units.

9.1 Set Up

Create a new do file called restructure.do that loads acs_hh_inc.dta:

capture log close
log using restructure.log, replace
clear all
use acs_hh_inc
set linesize 120 // only needed for Jupyter Notebook

. capture log close

. log using restructure.log, replace
------------------------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  C:\Users\rdimond\dws\restructure.log
  log type:  text
 opened on:   5 Jan 2023, 13:28:54

. clear all

. use acs_hh_inc

. set linesize 120 // only needed for Jupyter Notebook

. 

This is the ACS, as cleaned up in the First Steps chapter, with one additional variable.

Recall that this is hierarchical data consisting of people living in households, so a person is a level one unit and a household is a level two unit. The level one and level two identifier variables are person and household, and you can confirm that they uniquely identify observations with:

duplicates report household person

Duplicates in terms of household person

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        27410             0
--------------------------------------

The familiar variables age, race, marital_status, edu, income, female, and hispanic all describe individual persons, making them level one variables. The added variable household_income describes the household, making it a level two variable. Persons who live in the same household always have the same value of household_income.

9.2 Reshape

The reshape command converts data sets between long form and wide form. It is easy to use if you understand the structure of the your data set. In particular, you need to know the level one and level two identifiers and which variables are level one variables and which are level two.

The ACS data set has one observation per person, or level one unit, so the data set is currently in long form. (Take a look at the data set in the data browser if you’re not familiar with it.) In wide form it would have one observation per household, and reshape can do that for you:

reshape wide age race marital_status edu income female hispanic, ///
i(household) j(person)

. reshape wide age race marital_status edu income female hispanic, ///
> i(household) j(person)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations           27,410   ->   10,565      
Number of variables                  11   ->   115         
j variable (16 values)           person   ->   (dropped)
xij variables:
                                    age   ->   age1 age2 ... age16
                                   race   ->   race1 race2 ... race16
                         marital_status   ->   marital_status1 marital_status2 ... marital_status16
                                    edu   ->   edu1 edu2 ... edu16
                                 income   ->   income1 income2 ... income16
                                 female   ->   female1 female2 ... female16
                               hispanic   ->   hispanic1 hispanic2 ... hispanic16
-----------------------------------------------------------------------------

. 

The syntax of the reshape command begins by specifying the form you want the data to be in, in this case wide.

Then you give a list of all the level one variables in the data set, not including the level one identifier. This is a bit different from the usual syntax where a list of variables tells the command which variables to act on. The reshape command always reshapes the entire data set, but to do so it needs to understand which variables are level one and which are level two. The variables you list are level one variables; any variables you do not list are assumed to be level two variables. (Make sure this is true!)

Finally, the command needs to know the identifiers. The i() option specifies the level two identifier, which could be a compound identifier with multiple variables; the j() option specifies the level one identifier. It calls them i and j rather than level one and level two because reshape can be used on data with more than two levels of hierarchy.

Take a moment to view the result in the data browser (it won’t fit here). As promised, there is now just one observation per household, and the household variable is now a unique identifier all by itself. Most of the variable names now have two parts: the name of the quantity described (e.g. income) followed by the number of the person being described. The person variable has been changed from part of a compound row identifier to part of a compound column identifier.

Note that there are now 16 columns for each of the level one variables. The largest household had 16 people in it, so storing the information for all of them required 16 versions of each level one variable. Since the data set has to be rectangular, that means all the households have 16 of each level one variable, with most of them containing missing values. The household_income variable (on the far right in the data browser) remains a single variable because it is a household level variable.

Reshaping from wide form to long form requires the exact same command, just replacing wide with long:

reshape long age race marital_status edu income female hispanic, ///
i(household) j(person)

. reshape long age race marital_status edu income female hispanic, ///
> i(household) j(person)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations           10,565   ->   169,040     
Number of variables                 115   ->   11          
j variable (16 values)                    ->   person
xij variables:
                    age1 age2 ... age16   ->   age
                 race1 race2 ... race16   ->   race
marital_status1 marital_status2 ... marital_status16->marital_status
                    edu1 edu2 ... edu16   ->   edu
           income1 income2 ... income16   ->   income
           female1 female2 ... female16   ->   female
     hispanic1 hispanic2 ... hispanic16   ->   hispanic
-----------------------------------------------------------------------------

. 

However, the meaning is quite different: age, race, etc. do not refer to individual variables, but to groups of variables, and j(person) does not refer to an existing variable at all. The reshape long command will identify all the variables that start with age, race, etc. then take the number that follows and store it in a new variable called person.

With reshape wide, the list of level one variables is a list of actual variables, so you can use shortcuts like age-hispanic. With reshape long you are giving a list of “stubs” of variable names, so you must list them all individually (individual stubs, that is, not individual variables). We recommend dropping any variables you won’t actually use very early in the data wrangling process, but you definitely want to get rid of them before using reshape long.

If you look in the data browser there’s a lot more missing data than there used to be. You can see the problem by running:

duplicates report household

Duplicates in terms of household

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
       16 |       169040        158475
--------------------------------------

Every household now has exactly 16 observations in it. That’s because in wide form every household had 16 of each level one variable. This will happen any time you reshape a data set from wide to long: every level two unit will end up with same number of level one units as the largest level two unit. So now you have a bunch of observations that do not actually represent people. Fortunately you can detect and drop the extraneous level one units because they have missing values for all the level one variables:

drop if age==. & race==. & marital_status==. & ///
edu==. & income==. & female==. & hispanic==.

. drop if age==. & race==. & marital_status==. & ///
> edu==. & income==. & female==. & hispanic==.
(141,630 observations deleted)

. 

Keep in mind that real people can have missing data, so you would not want to drop observations that have a missing value for one level one variable or even a few of them. Include all the level one variables in your if condition.

Exercise 1

Load the data set nlsy_extract.dta. Identify the level one and level two units, and the level one and level two variables. Run duplicates report id. Reshape the data set to wide form, and then reshape it again to long. Run duplicates report id again. Why don’t you need to worry about extraneous observations in this case?

9.3 Creating Data Sets of Level Two Units

Sometimes you want to get rid of the level one units in your data set entirely so you’re left with a data set of level two units. For the ACS that would be a data set of households, with no individual-level variables. If all the level two variables you want already exist, that’s very easy to do.

If the data set is already in wide form this is just a matter of dropping the level one variables.

If the data set is in long form, like our ACS sample, begin by dropping the level one identifier and all of the level one variables:

clear
use acs_hh_inc

drop person age race marital_status edu income female hispanic
list if household==37, ab(30)

. clear

. use acs_hh_inc

. 
. drop person age race marital_status edu income female hispanic

. list if household==37, ab(30)

       +------------------------------+
       | household   household_income |
       |------------------------------|
    1. |        37              20000 |
    2. |        37              20000 |
    3. |        37              20000 |
       +------------------------------+

. 

Now you just need to keep one observation for each household. Note how all the observations for a given household are now identical so it doesn’t matter which one you keep, but keeping the first observation is easy to do:

by household: keep if _n==1
list in 1/5, ab(30)

. by household: keep if _n==1
(16,845 observations deleted)

. list in 1/5, ab(30)

     +------------------------------+
     | household   household_income |
     |------------------------------|
  1. |        37              20000 |
  2. |       241              32500 |
  3. |       242              30000 |
  4. |       377              51900 |
  5. |       418              12200 |
     +------------------------------+

. 

Now you have a data set that describes the households.

Often, however, you need to create new level two variables based on the level one units before you can eliminate the level one units entirely. In this example we’ll create variables for “number of people in the household” and “proportion of the household that is female” as well as keeping the household income variable we created previously. You can always do that by creating the level two variables using the methods described in the previous chapter and then dropping the level one units using the method just described. But if all the variables you need to create are summary statistics, the collapse command can do the entire process for you quickly and easily.

The collapse command takes all the observations (level one units) for a given level two unit and aggregates them into a single observation. Thus it needs to know the level two identifier, which variables you want aggregated, and how you want to aggregate them. Most of the aggregation rules are based on summary statistics.

Reload the data, and then run the following collapse command:

clear
use acs_hh_inc

collapse ///
    (first) household_income ///
    (mean) proportion_female=female ///
    (count) household_size=person, ///
    by(household)
    
list in 1/6, ab(30)

. clear

. use acs_hh_inc

. 
. collapse ///
>     (first) household_income ///
>     (mean) proportion_female=female ///
>     (count) household_size=person, ///
>     by(household)

.     
. list in 1/6, ab(30)

     +-------------------------------------------------------------------+
     | household   household_income   proportion_female   household_size |
     |-------------------------------------------------------------------|
  1. |        37              20000                   1                3 |
  2. |       241              32500                   1                1 |
  3. |       242              30000                   1                1 |
  4. |       377              51900                   1                1 |
  5. |       418              12200                   1                1 |
     |-------------------------------------------------------------------|
  6. |       465               2600                  .5                2 |
     +-------------------------------------------------------------------+

. 

This gives you a data set of households and household-level variables as advertised. Note how the only variables remaining in the data set are the variables mentioned in the collapse command. Now let’s consider the elements of the collapse command in turn:

(first) means the variables that follow should be aggregated using the rule “keep the first value.” The household_income variable is already a level two variable, so all the values for a given level two unit are the same and we just need to keep the first one.

(mean) means the variables that follow should be aggregated using the rule “take the mean.” The female variable is binary, taking its mean tells us the proportion of household members that are female. However, we don’t want to call the result female, so we rename it to proportion_female. It’s a bit backwards, but (mean) proportion_female=female can be read “take the mean of the female variable and call the result proportion_female.”

(count) means the variables that follow should be aggregated using the rule “count the number of non-missing values.” What we really want is the number of observations, but for any variable with no missing values that will be the same thing. If we had to we could create such a variable (gen temp = 1) but we know that person has no missing values (identifiers rarely do) so we can use it. Again, we want to give the result a new name, and the syntax (count) household_size=person can be read “count the number of non-missing values of person and call the result household_size.”

You can list many variables after each aggregation rule, not just one. You can also skip specifying an aggregation rule, in which case Stata will assume you want means. Type help collapse to see a list of available aggregation rules.

The by(household) option tells collapse the level two identifier so it knows which observations to aggregate.

Exercise 2

Load nlsy_extract.dta and use collapse to convert it into a data set with one row per person. The resulting data set should contain the person’s year of birth, their mean income over the study period, and their maximum educational attainment.