6  Working With Hierarchical Data in Wide Form

Wide form data almost always looks uglier than long form, and some tasks are always harder to do in wide form than in long form. Others can be done just as easily in wide form if you get comfortable working across columns rather than down rows. Some Python functions work exactly the same either way.

The Pythonic way to work with wide form data is with a column MultiIndex that reflects the structure of the data, just like we used a row MultiIndex in long form. However, column MultiIndexes are unique to Python and don’t have a great way to handle level 2 variables. So you’ll more commonly see wide form data where the variable name reflects both levels of the hierarchy, like age_1 for ‘age of person 1’. We’ll demonstrate both methods.

This chapter is meant to be read after Working with Hierarchical Data in Long Form. That chapter introduces the core concepts of hierarchical data (in any form), contains more example tasks, and explains them in more detail. This chapter mostly translates what you learned in the previous chapter to wide form.

6.1 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Hierarchical_Wide_Practice.ipynb. Have it import Pandas and NumPy, then load acs_wide_mi.pickle as acs:

import pandas as pd
import numpy as np

acs = pd.read_pickle('acs_wide_mi.pickle')
acs
age race income female
person 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
household
37 20.0 19.0 19.0 NaN White White Black NaN 10000.0 5300.0 4700.0 NaN True True True NaN
241 50.0 NaN NaN NaN White NaN NaN NaN 32500.0 NaN NaN NaN True NaN NaN NaN
242 29.0 NaN NaN NaN White NaN NaN NaN 30000.0 NaN NaN NaN True NaN NaN NaN
377 69.0 NaN NaN NaN White NaN NaN NaN 51900.0 NaN NaN NaN True NaN NaN NaN
418 59.0 NaN NaN NaN White NaN NaN NaN 12200.0 NaN NaN NaN True NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1236119 51.0 51.0 NaN NaN White White NaN NaN 62200.0 31000.0 NaN NaN False True NaN NaN
1236287 41.0 42.0 23.0 4.0 American Indian White White White 15000.0 27000.0 11400.0 NaN False True True False
1236624 29.0 26.0 NaN NaN White White NaN NaN 50100.0 12000.0 NaN NaN False True NaN NaN
1236756 58.0 61.0 NaN NaN White White NaN NaN 69800.0 40800.0 NaN NaN True False NaN NaN
1236779 30.0 NaN NaN NaN American Indian NaN NaN NaN 22110.0 NaN NaN NaN False NaN NaN NaN

10565 rows × 16 columns

This data set is a subset of the ACS data you’ve been working with. It contains fewer variables and only the first four people in each household. However, it is in wide form, meaning that there’s one row per level 2 unit (household) and one column per variable per level one unit (person). That makes it quite wide (which is why it’s a subset of the original).

The number of columns is determined by the number of people in the largest household. Household 1236287, for example, has four people in it and thus four columns are needed to store all their information. (The full data set has a household with 16 people in it, so it requires 16 columns per variable.) However, unlike long form where different households can have different numbers of rows, in wide form every household must have the same number of columns. Households with fewer people will have NaN in the columns for people that don’t exist in their household. For example, household 37 has three people, so the columns for person 4 are always NaN.

6.2 Wide Form With a Column MultiIndex

In order to pick out a particular number from this data set, I need a household number, a person number, and a variable. For example, 20 is the age of person 1 in household 37. In the long form, household and person identify a row, and a variable identifies a column. A row MultIndex allowed Python to understand the relationship between household and person.

In this version of wide format, household identifies a row and a combination of a variable and person identify a column. The column MultiIndex allows Python to understand the relationship between variable and person. Note that we have not officially named the first level of the MultiIndex variable (we could) so we’ll just refer to it as level 0. The second level is named person, or level 1.

6.2.1 Subsetting

Up to this point, all of our column indexes have been simple strings like age or income. Now, to identify a column you need to identify both the variable and the person of interest. This is done as a tuple, for example ('age', 1) for ‘age of person 1.’ Tuples like this can be used as column names in square brackets just like the strings we’ve used before:

acs[('age',1)]
household
37         20.0
241        50.0
242        29.0
377        69.0
418        59.0
           ... 
1236119    51.0
1236287    41.0
1236624    29.0
1236756    58.0
1236779    30.0
Name: (age, 1), Length: 10565, dtype: float64

They can also be used along with a row index in loc to pick out a single number:

acs.loc[37, ('age', 1)]
20.0

Note how the household number 37 now identifies a single row, since in wide form there’s just one observation per household.

The xs() function gives you more flexibility. To select all the age variables, use:

acs.xs('age', axis=1)
person 1 2 3 4
household
37 20.0 19.0 19.0 NaN
241 50.0 NaN NaN NaN
242 29.0 NaN NaN NaN
377 69.0 NaN NaN NaN
418 59.0 NaN NaN NaN
... ... ... ... ...
1236119 51.0 51.0 NaN NaN
1236287 41.0 42.0 23.0 4.0
1236624 29.0 26.0 NaN NaN
1236756 58.0 61.0 NaN NaN
1236779 30.0 NaN NaN NaN

10565 rows × 4 columns

The argument axis=1 clarifies that you’re talking about the column indexes. (The rows are axis 0.) You’ll be using axis=1 a lot in this chapter. xs() will assume you’re specifying level 0 (variable) unless you say otherwise.

Since all the columns are age, by default xs() drops that level from the results. You can tell it not to with drop_level=False:

acs.xs('age', axis=1, drop_level=False)
age
person 1 2 3 4
household
37 20.0 19.0 19.0 NaN
241 50.0 NaN NaN NaN
242 29.0 NaN NaN NaN
377 69.0 NaN NaN NaN
418 59.0 NaN NaN NaN
... ... ... ... ...
1236119 51.0 51.0 NaN NaN
1236287 41.0 42.0 23.0 4.0
1236624 29.0 26.0 NaN NaN
1236756 58.0 61.0 NaN NaN
1236779 30.0 NaN NaN NaN

10565 rows × 4 columns

You can also use xs() to select all the variables for a given person. In doing so, you need to specify that you’re selecting based on level 1 of the MultiIndex, which you can do by number or by name (person):

acs.xs(1, level=1, axis=1)
age race income female
household
37 20.0 White 10000.0 True
241 50.0 White 32500.0 True
242 29.0 White 30000.0 True
377 69.0 White 51900.0 True
418 59.0 White 12200.0 True
... ... ... ... ...
1236119 51.0 White 62200.0 False
1236287 41.0 American Indian 15000.0 False
1236624 29.0 White 50100.0 False
1236756 58.0 White 69800.0 True
1236779 30.0 American Indian 22110.0 False

10565 rows × 4 columns

This looks a lot like the long form data, but there are far fewer rows because only the first person in each household is included. If we had specified drop_level=False we’d be reminded of that.

6.2.2 Creating Level 1 Variables

In long form, creating level 1 variables is straightforward. For example, to create an indicator variable for ‘this person is Black’ you’d run something like:

acs['black'] = (acs['race']=='Black')

This acts on all the rows in the data set automatically. Since in long form the people in a household are rows, it thus takes care of all the people in a household automatically. The simple wide-form equivalent:

acs[('black', 1)] = (acs[('race', 1)]=='Black')

would only take care of person 1 in each household. Clearly this is a job for a for loop. But there’s another wrinkle: we now have lots of missing values for race, because that’s how wide form tells us a person doesn’t exist.

Let’s start by creating a single column properly, including the missing values. Creating column 4 will make it easy to see if we’ve succeeded. Recall that the np.where() function takes a condition, the result if the condition is true, and the result if the condition is false. That makes it easy to check if ('race', 4) is missing, set the new column to np.NaN if it is, and set it to the result of ('race', 4)=='Black' if it is not. Just remember that we must specify that the result is boolean with astype('boolean') or NumPy will return 1/0 instead of True/False.

acs[('black', 4)] = np.where(
    acs[('race', 4)].isna(), 
    np.NaN, 
    (acs[('race', 4)]=='Black').astype('boolean'))
acs.xs(4, level='person', axis=1, drop_level=False)
age race income female black
person 4 4 4 4 4
household
37 NaN NaN NaN NaN NaN
241 NaN NaN NaN NaN NaN
242 NaN NaN NaN NaN NaN
377 NaN NaN NaN NaN NaN
418 NaN NaN NaN NaN NaN
... ... ... ... ... ...
1236119 NaN NaN NaN NaN NaN
1236287 4.0 White NaN False False
1236624 NaN NaN NaN NaN NaN
1236756 NaN NaN NaN NaN NaN
1236779 NaN NaN NaN NaN NaN

10565 rows × 5 columns

The result is NaN for all the rows we can see except 1236287, where it is correctly False, so now we’re ready to wrap that in a for loop. Before you do, drop ('black', 4) just because if you don’t it will be out of order.

acs = acs.drop(columns=('black', 4))

In order to make the code that created ('black', 4) do a different column, what needs to change is the person number in the tuples ('black', 4) and ('race', 4). That tells us we need to loop over person numbers. We can get the numbers 1 through 4 with range(1, 5) since range() leaves off the endpoint.

for person in range(1, 5):
    acs[('black', person)] = np.where(
        acs[('race', person)].isna(),
        np.NaN, 
        (acs[('race', person)]=='Black').astype('boolean')
    )

acs.xs('black', axis=1, drop_level=False)
black
person 1 2 3 4
household
37 False False True NaN
241 False NaN NaN NaN
242 False NaN NaN NaN
377 False NaN NaN NaN
418 False NaN NaN NaN
... ... ... ... ...
1236119 False False NaN NaN
1236287 False False False False
1236624 False False NaN NaN
1236756 False False NaN NaN
1236779 False NaN NaN NaN

10565 rows × 4 columns

Exercise

Create an indicator variable called child for each person. Be sure it’s NaN for people who don’t exist.

for person in range(1,5):
    acs[('child', person)] = np.where(
        acs[('age', person)].isna(),
        np.NaN,
        (acs[('age', person)]<18).astype('boolean')
    )

acs.xs('child', axis=1, drop_level=False)
child
person 1 2 3 4
household
37 False False False NaN
241 False NaN NaN NaN
242 False NaN NaN NaN
377 False NaN NaN NaN
418 False NaN NaN NaN
... ... ... ... ...
1236119 False False NaN NaN
1236287 False False False True
1236624 False False NaN NaN
1236756 False False NaN NaN
1236779 False NaN NaN NaN

10565 rows × 4 columns

6.2.3 Creating Level 2 Variables

Next we’ll learn how to make level 2 variables like household_income. In long form, you had to use groupby('household') to specify which rows should be aggregated. That’s not an issue in wide form, since each household has its own row. Instead you need to specify which columns should be aggregated. You can do that with xs(). For example, to calculate household_income you need all the income columns:

acs.xs('income', axis=1)
person 1 2 3 4
household
37 10000.0 5300.0 4700.0 NaN
241 32500.0 NaN NaN NaN
242 30000.0 NaN NaN NaN
377 51900.0 NaN NaN NaN
418 12200.0 NaN NaN NaN
... ... ... ... ...
1236119 62200.0 31000.0 NaN NaN
1236287 15000.0 27000.0 11400.0 NaN
1236624 50100.0 12000.0 NaN NaN
1236756 69800.0 40800.0 NaN NaN
1236779 22110.0 NaN NaN NaN

10565 rows × 4 columns

So how do you add them up? With the sum() function as usual, but pass in axis=1 so it adds them up across columns instead of rows:

acs.xs('income', axis=1).sum(axis=1)
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Length: 10565, dtype: float64

In long form, we needed transform() to broadcast the result of sum() to all the rows in the household. That’s not necessary in wide form, since each household only has one row.

The trick is the column name. The columns in this dataset are identified by a MultiIndex–no exceptions. So even though household income is not a property of a person, the tuple that identifies the column must have two values. We’ll use ’‘, an empty string, to denote ’this is a level 2 variable.’ Thus, to create a variable for household income, run:

acs[('household_income', '')] = acs.xs('income', axis=1).sum(axis=1)
acs[('household_income', '')]
household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: (household_income, ), Length: 10565, dtype: float64
Exercise

Create an indicator variable for ‘this household contains children.’

acs[('has_children', '')] = acs.xs('child', axis=1).max(axis=1)
acs[('has_children', '')]
household
37         False
241        False
242        False
377        False
418        False
           ...  
1236119    False
1236287     True
1236624    False
1236756    False
1236779    False
Name: (has_children, ), Length: 10565, dtype: object

6.2.4 Subsetting With Aggregate Functions

So how do you calculate a level 2 variable based on a subset of the level 1 variables, like child_income? The same way as in long form: first create a level 1 variable that contains actual income for children and NaN for adults, then aggregate it. The difference is that in wide form creating a level 1 variable requires a loop over columns. It can still be easily done using the Series where() function, which takes a condition and then an other argument that specifies what the value should be changed to for rows that don’t meet the condition.

for person in range(1, 5):
    acs[('income_if_child', person)] = acs[('income', person)].where(
        acs[('age', person)]<18,
        other=np.NaN
    )

acs.xs('income_if_child', axis=1, drop_level=False).loc[8787]

With an income_if_child variable in hand for each person, all you need to do is add them up:

acs[('child_income', '')] = acs.xs('income_if_child', axis=1).sum(axis=1)

acs.xs('child_income', axis=1, drop_level=False).loc[8787]
Exercise

Create a level 2 variable containing the mean age of the adults in each household.

for person in range(1, 5):
    acs[('age_if_adult', person)] = acs[('age', person)].where(
        acs[('age', person)]>=18,
        other=np.NaN
    )

acs[('mean_adult_age', '')] = acs.xs('age_if_adult', axis=1).mean(axis=1)
acs[('mean_adult_age', '')]

6.3 Wide Form Data With Compound Variable Names

While a column MultiIndex has a certain elegance, in practice it’s more common to stick with strings for column identifiers. The string contains both levels of the index, such as age_1 (or age1) for ‘age of person 1’. But that structure is not enforced, so there’s no problem creating a variable like household_income.

The disadvantage is that it’s not as easy to specify subsets like ‘all the income variables.’ The good news is the tools you’ll learn for doing that also apply to data that’s stored as strings (i.e. text data).

Load the example data set acs_wide_strings as the new acs to see what this looks like in practice:

acs = pd.read_pickle('acs_wide_strings.pickle')
acs

6.3.1 Subsetting

Selecting a column or row and column now works just like before (no tuple required):

acs['age_1']
acs.loc[37, 'age_1']

Selecting all the columns for a variable or person gets more interesting. How can well tell Python to select all the variables that start with age or all the variables that end with 1? The answer is the DataFrame filter() function.

The filter() function allows you to subset based on patterns in the column names. A very simple pattern is like=. This will select all the columns that contain the string you pass in. Select all the income columns by passing in like='income':

acs.filter(like='income')

This works fine for now, but if our data set had a household_income variable it would be included too.

You can select all the columns for person 1 by passing in `like=’_1’:

acs.filter(like='_1')

Again, this works fine for now. But remember the original data had 16 columns per variable. like='_1' would also match the columns for persons 10 through 16.

We need the ability to be much more specific about the columns to be selected. The solution is regular expressions. Regular expressions are a syntax for describing a pattern in text so that a computer can detect text that matches it. They’ve been around for more than 50 years and most programming languages and statistical packages implement them. They can be used for a variety of purposes; we won’t discuss working with text data (strings) but we’re laying the foundation by introducing you to regular expressions. Regular expressions can be extremely complex. We’ll introduce you to just a few of the most important features but you can find many, many tutorials online.

You use a regular expression with filter() by passing in regex=. The simplest regular expression is just some text, and a string matches it if it contains that text. In other words, it does the same thing as like=:

acs.filter(regex='income')

But a regular expression can do much more. Here are the tools we’ll use:

  • If a regular expression starts with ^, then the string only matches if it starts with the pattern.

  • If a regular expression ends with $, then the string only matches if it ends with the pattern.

  • \d matches any digit (number).

  • A * after something means the string can contain any number of that thing at that point. So \d* means the string can contain any number of digits at that point.

Putting this all together, the regular expression ^income_\d*$ will match strings that start with ‘income_’ followed by any number of digits and then end. This will allow us to match the columns containing the income for each person (income_1, income_2, etc.) without also matching columns like household_income:

acs.filter(regex='^income_\d*$')

The regular expression _1$ will match strings that end with ’_1’. This excludes variables that end with ’_10’, ’_11’, etc. and thus allows us to select just the columns describing person 1:

acs.filter(regex='_1$')
Exercise

Use a regular expression to select all the race columns. Then use one to select all the variables for person 4.

acs.filter(regex='^race_\d*$')
acs.filter(regex='_4$')

6.3.2 Creating Level 1 Variables

Creating level 1 variables calls for a for loop just like with a MultiIndex. Recall that adding strings concatenates them in Python, so the equivalent of the tuple ('race', person), where person is a numeric variable, is the string 'race_' + str(person). Thus the code to create indicator variables for ‘this person is Black’ becomes:

for person in range(1, 5):
    acs['black_' + str(person)] = np.where(
        acs['race_' +str(person)].isna(),
        np.NaN, 
        (acs['race_' + str(person)]=='Black').astype('boolean')
    )

acs.filter(regex='^black_\d*$')
Exercise

Create an indicator variable for ‘this person is a child’.

for person in range(1, 5):
    acs['child_' + str(person)] = np.where(
        acs['age_' +str(person)].isna(),
        np.NaN, 
        (acs['age_' + str(person)]<18).astype('boolean')
    )

acs.filter(regex='^child_\d*$')

6.3.3 Creating Level 2 Variables

Creating level 2 variables is just a matter of replacing the xs() we used previously with the appropriate filter(). To calculate the household income, use:

acs['household_income'] = acs.filter(regex='^income_\d*$').sum(axis=1)

acs.filter(like='income')
Exercise

Create an indicator for ‘this household consists entirely of children’.

acs['all_children'] = acs.filter(regex='^child_\d*$').min(axis=1)

acs.filter(like='child').loc[73731]

6.3.4 Subsetting With Aggregate Functions

Finally, subsetting with an aggregate function is again a combination of creating level 1 variables containing just the values of interest with a loop and the aggregating them.

for person in range(1, 5):
    acs['income_if_child_' + str(person)] = acs['income_' + str(person)].where(
        acs['age_' + str(person)]<18,
        other=np.NaN
    )

acs.filter(regex='^income_if_child_\d*$').loc[8787]
acs['child_income'] = acs.filter(regex='^income_if_child_\d*$').sum(axis=1)

acs.filter(like='income').loc[8787]
Exercise

Create a level 2 variable containing the mean age of the adults in each household.

for person in range(1, 5):
    acs['age_if_adult_' + str(person)] = acs['age_' + str(person)].where(
        acs['age_' + str(person)]>=18,
        other=np.NaN
    )

acs['mean_adult_age'] = acs.filter(regex='^age_if_adult_\d*$').mean(axis=1)
acs.filter(like='age')