6 Working With Hierarchical Data in Wide Form

Wide form data almost always looks uglier than long form, and some tasks are always harder to do in wide form than in long form. Others can be done just as easily in wide form if you get comfortable working across columns rather than down rows. Some Python functions work exactly the same either way.

The Pythonic way to work with wide form data is with a column MultiIndex that reflects the structure of the data, just like we used a row MultiIndex in long form. However, column MultiIndexes are unique to Python and don’t have a great way to handle level 2 variables. So you’ll more commonly see wide form data where the variable name reflects both levels of the hierarchy, like age_1 for ‘age of person 1’. We’ll demonstrate both methods.

This chapter is meant to be read after Working with Hierarchical Data in Long Form. That chapter introduces the core concepts of hierarchical data (in any form), contains more example tasks, and explains them in more detail. This chapter mostly translates what you learned in the previous chapter to wide form.

6.1 Setting Up

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Hierarchical_Wide_Practice.ipynb. Have it import Pandas and NumPy, then load acs_wide_mi.pickle as acs:

import pandas as pd
import numpy as np

acs = pd.read_pickle('acs_wide_mi.pickle')
acs

	age				race				income				female
person	1	2	3	4	1	2	3	4	1	2	3	4	1	2	3	4
household
37	20.0	19.0	19.0	NaN	White	White	Black	NaN	10000.0	5300.0	4700.0	NaN	True	True	True	NaN
241	50.0	NaN	NaN	NaN	White	NaN	NaN	NaN	32500.0	NaN	NaN	NaN	True	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN	White	NaN	NaN	NaN	30000.0	NaN	NaN	NaN	True	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN	White	NaN	NaN	NaN	51900.0	NaN	NaN	NaN	True	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN	White	NaN	NaN	NaN	12200.0	NaN	NaN	NaN	True	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1236119	51.0	51.0	NaN	NaN	White	White	NaN	NaN	62200.0	31000.0	NaN	NaN	False	True	NaN	NaN
1236287	41.0	42.0	23.0	4.0	American Indian	White	White	White	15000.0	27000.0	11400.0	NaN	False	True	True	False
1236624	29.0	26.0	NaN	NaN	White	White	NaN	NaN	50100.0	12000.0	NaN	NaN	False	True	NaN	NaN
1236756	58.0	61.0	NaN	NaN	White	White	NaN	NaN	69800.0	40800.0	NaN	NaN	True	False	NaN	NaN
1236779	30.0	NaN	NaN	NaN	American Indian	NaN	NaN	NaN	22110.0	NaN	NaN	NaN	False	NaN	NaN	NaN

10565 rows × 16 columns

This data set is a subset of the ACS data you’ve been working with. It contains fewer variables and only the first four people in each household. However, it is in wide form, meaning that there’s one row per level 2 unit (household) and one column per variable per level one unit (person). That makes it quite wide (which is why it’s a subset of the original).

The number of columns is determined by the number of people in the largest household. Household 1236287, for example, has four people in it and thus four columns are needed to store all their information. (The full data set has a household with 16 people in it, so it requires 16 columns per variable.) However, unlike long form where different households can have different numbers of rows, in wide form every household must have the same number of columns. Households with fewer people will have NaN in the columns for people that don’t exist in their household. For example, household 37 has three people, so the columns for person 4 are always NaN.

6.2 Wide Form With a Column MultiIndex

In order to pick out a particular number from this data set, I need a household number, a person number, and a variable. For example, 20 is the age of person 1 in household 37. In the long form, household and person identify a row, and a variable identifies a column. A row MultIndex allowed Python to understand the relationship between household and person.

In this version of wide format, household identifies a row and a combination of a variable and person identify a column. The column MultiIndex allows Python to understand the relationship between variable and person. Note that we have not officially named the first level of the MultiIndex variable (we could) so we’ll just refer to it as level 0. The second level is named person, or level 1.

6.2.1 Subsetting

Up to this point, all of our column indexes have been simple strings like age or income. Now, to identify a column you need to identify both the variable and the person of interest. This is done as a tuple, for example ('age', 1) for ‘age of person 1.’ Tuples like this can be used as column names in square brackets just like the strings we’ve used before:

acs[('age',1)]

household
37         20.0
241        50.0
242        29.0
377        69.0
418        59.0
           ... 
1236119    51.0
1236287    41.0
1236624    29.0
1236756    58.0
1236779    30.0
Name: (age, 1), Length: 10565, dtype: float64

They can also be used along with a row index in loc to pick out a single number:

acs.loc[37, ('age', 1)]

20.0

Note how the household number 37 now identifies a single row, since in wide form there’s just one observation per household.

The xs() function gives you more flexibility. To select all the age variables, use:

acs.xs('age', axis=1)

person	1	2	3	4
household
37	20.0	19.0	19.0	NaN
241	50.0	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN
...	...	...	...	...
1236119	51.0	51.0	NaN	NaN
1236287	41.0	42.0	23.0	4.0
1236624	29.0	26.0	NaN	NaN
1236756	58.0	61.0	NaN	NaN
1236779	30.0	NaN	NaN	NaN

10565 rows × 4 columns

The argument axis=1 clarifies that you’re talking about the column indexes. (The rows are axis 0.) You’ll be using axis=1 a lot in this chapter. xs() will assume you’re specifying level 0 (variable) unless you say otherwise.

Since all the columns are age, by default xs() drops that level from the results. You can tell it not to with drop_level=False:

acs.xs('age', axis=1, drop_level=False)

	age
person	1	2	3	4
household
37	20.0	19.0	19.0	NaN
241	50.0	NaN	NaN	NaN
242	29.0	NaN	NaN	NaN
377	69.0	NaN	NaN	NaN
418	59.0	NaN	NaN	NaN
...	...	...	...	...
1236119	51.0	51.0	NaN	NaN
1236287	41.0	42.0	23.0	4.0
1236624	29.0	26.0	NaN	NaN
1236756	58.0	61.0	NaN	NaN
1236779	30.0	NaN	NaN	NaN

10565 rows × 4 columns

You can also use xs() to select all the variables for a given person. In doing so, you need to specify that you’re selecting based on level 1 of the MultiIndex, which you can do by number or by name (person):

acs.xs(1, level=1, axis=1)

	age	race	income	female
household
37	20.0	White	10000.0	True
241	50.0	White	32500.0	True
242	29.0	White	30000.0	True
377	69.0	White	51900.0	True
418	59.0	White	12200.0	True
...	...	...	...	...
1236119	51.0	White	62200.0	False
1236287	41.0	American Indian	15000.0	False
1236624	29.0	White	50100.0	False
1236756	58.0	White	69800.0	True
1236779	30.0	American Indian	22110.0	False

10565 rows × 4 columns

This looks a lot like the long form data, but there are far fewer rows because only the first person in each household is included. If we had specified drop_level=False we’d be reminded of that.

6.2.2 Creating Level 1 Variables

In long form, creating level 1 variables is straightforward. For example, to create an indicator variable for ‘this person is Black’ you’d run something like:

acs['black'] = (acs['race']=='Black')

This acts on all the rows in the data set automatically. Since in long form the people in a household are rows, it thus takes care of all the people in a household automatically. The simple wide-form equivalent:

acs[('black', 1)] = (acs[('race', 1)]=='Black')

would only take care of person 1 in each household. Clearly this is a job for a for loop. But there’s another wrinkle: we now have lots of missing values for race, because that’s how wide form tells us a person doesn’t exist.

Let’s start by creating a single column properly, including the missing values. Creating column 4 will make it easy to see if we’ve succeeded. Recall that the np.where() function takes a condition, the result if the condition is true, and the result if the condition is false. That makes it easy to check if ('race', 4) is missing, set the new column to np.NaN if it is, and set it to the result of ('race', 4)=='Black' if it is not. Just remember that we must specify that the result is boolean with astype('boolean') or NumPy will return 1/0 instead of True/False.

acs[('black', 4)] = np.where(
    acs[('race', 4)].isna(), 
    np.NaN, 
    (acs[('race', 4)]=='Black').astype('boolean'))
acs.xs(4, level='person', axis=1, drop_level=False)

	age	race	income	female	black
person	4	4	4	4	4
household
37	NaN	NaN	NaN	NaN	NaN
241	NaN	NaN	NaN	NaN	NaN
242	NaN	NaN	NaN	NaN	NaN
377	NaN	NaN	NaN	NaN	NaN
418	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...
1236119	NaN	NaN	NaN	NaN	NaN
1236287	4.0	White	NaN	False	False
1236624	NaN	NaN	NaN	NaN	NaN
1236756	NaN	NaN	NaN	NaN	NaN
1236779	NaN	NaN	NaN	NaN	NaN

10565 rows × 5 columns

The result is NaN for all the rows we can see except 1236287, where it is correctly False, so now we’re ready to wrap that in a for loop. Before you do, drop ('black', 4) just because if you don’t it will be out of order.

acs = acs.drop(columns=('black', 4))

In order to make the code that created ('black', 4) do a different column, what needs to change is the person number in the tuples ('black', 4) and ('race', 4). That tells us we need to loop over person numbers. We can get the numbers 1 through 4 with range(1, 5) since range() leaves off the endpoint.

for person in range(1, 5):
    acs[('black', person)] = np.where(
        acs[('race', person)].isna(),
        np.NaN, 
        (acs[('race', person)]=='Black').astype('boolean')
    )

acs.xs('black', axis=1, drop_level=False)

	black
person	1	2	3	4
household
37	False	False	True	NaN
241	False	NaN	NaN	NaN
242	False	NaN	NaN	NaN
377	False	NaN	NaN	NaN
418	False	NaN	NaN	NaN
...	...	...	...	...
1236119	False	False	NaN	NaN
1236287	False	False	False	False
1236624	False	False	NaN	NaN
1236756	False	False	NaN	NaN
1236779	False	NaN	NaN	NaN

10565 rows × 4 columns

Exercise

Create an indicator variable called child for each person. Be sure it’s NaN for people who don’t exist.

Solution

for person in range(1,5):
    acs[('child', person)] = np.where(
        acs[('age', person)].isna(),
        np.NaN,
        (acs[('age', person)]<18).astype('boolean')
    )

acs.xs('child', axis=1, drop_level=False)

	child
person	1	2	3	4
household
37	False	False	False	NaN
241	False	NaN	NaN	NaN
242	False	NaN	NaN	NaN
377	False	NaN	NaN	NaN
418	False	NaN	NaN	NaN
...	...	...	...	...
1236119	False	False	NaN	NaN
1236287	False	False	False	True
1236624	False	False	NaN	NaN
1236756	False	False	NaN	NaN
1236779	False	NaN	NaN	NaN

10565 rows × 4 columns

6.2.3 Creating Level 2 Variables

Next we’ll learn how to make level 2 variables like household_income. In long form, you had to use groupby('household') to specify which rows should be aggregated. That’s not an issue in wide form, since each household has its own row. Instead you need to specify which columns should be aggregated. You can do that with xs(). For example, to calculate household_income you need all the income columns:

acs.xs('income', axis=1)

person	1	2	3	4
household
37	10000.0	5300.0	4700.0	NaN
241	32500.0	NaN	NaN	NaN
242	30000.0	NaN	NaN	NaN
377	51900.0	NaN	NaN	NaN
418	12200.0	NaN	NaN	NaN
...	...	...	...	...
1236119	62200.0	31000.0	NaN	NaN
1236287	15000.0	27000.0	11400.0	NaN
1236624	50100.0	12000.0	NaN	NaN
1236756	69800.0	40800.0	NaN	NaN
1236779	22110.0	NaN	NaN	NaN

10565 rows × 4 columns

So how do you add them up? With the sum() function as usual, but pass in axis=1 so it adds them up across columns instead of rows:

acs.xs('income', axis=1).sum(axis=1)

household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Length: 10565, dtype: float64

In long form, we needed transform() to broadcast the result of sum() to all the rows in the household. That’s not necessary in wide form, since each household only has one row.

The trick is the column name. The columns in this dataset are identified by a MultiIndex–no exceptions. So even though household income is not a property of a person, the tuple that identifies the column must have two values. We’ll use ’‘, an empty string, to denote ’this is a level 2 variable.’ Thus, to create a variable for household income, run:

acs[('household_income', '')] = acs.xs('income', axis=1).sum(axis=1)
acs[('household_income', '')]

household
37          20000.0
241         32500.0
242         30000.0
377         51900.0
418         12200.0
             ...   
1236119     93200.0
1236287     53400.0
1236624     62100.0
1236756    110600.0
1236779     22110.0
Name: (household_income, ), Length: 10565, dtype: float64

Exercise

Create an indicator variable for ‘this household contains children.’

Solution

acs[('has_children', '')] = acs.xs('child', axis=1).max(axis=1)
acs[('has_children', '')]

household
37         False
241        False
242        False
377        False
418        False
           ...  
1236119    False
1236287     True
1236624    False
1236756    False
1236779    False
Name: (has_children, ), Length: 10565, dtype: object

6.2.4 Subsetting With Aggregate Functions

So how do you calculate a level 2 variable based on a subset of the level 1 variables, like child_income? The same way as in long form: first create a level 1 variable that contains actual income for children and NaN for adults, then aggregate it. The difference is that in wide form creating a level 1 variable requires a loop over columns. It can still be easily done using the Series where() function, which takes a condition and then an other argument that specifies what the value should be changed to for rows that don’t meet the condition.

for person in range(1, 5):
    acs[('income_if_child', person)] = acs[('income', person)].where(
        acs[('age', person)]<18,
        other=np.NaN
    )

acs.xs('income_if_child', axis=1, drop_level=False).loc[8787]

With an income_if_child variable in hand for each person, all you need to do is add them up:

acs[('child_income', '')] = acs.xs('income_if_child', axis=1).sum(axis=1)

acs.xs('child_income', axis=1, drop_level=False).loc[8787]

Exercise

Create a level 2 variable containing the mean age of the adults in each household.

Solution

for person in range(1, 5):
    acs[('age_if_adult', person)] = acs[('age', person)].where(
        acs[('age', person)]>=18,
        other=np.NaN
    )

acs[('mean_adult_age', '')] = acs.xs('age_if_adult', axis=1).mean(axis=1)
acs[('mean_adult_age', '')]

6.3 Wide Form Data With Compound Variable Names

While a column MultiIndex has a certain elegance, in practice it’s more common to stick with strings for column identifiers. The string contains both levels of the index, such as age_1 (or age1) for ‘age of person 1’. But that structure is not enforced, so there’s no problem creating a variable like household_income.

The disadvantage is that it’s not as easy to specify subsets like ‘all the income variables.’ The good news is the tools you’ll learn for doing that also apply to data that’s stored as strings (i.e. text data).

Load the example data set acs_wide_strings as the new acs to see what this looks like in practice:

acs = pd.read_pickle('acs_wide_strings.pickle')
acs

6.3.1 Subsetting

Selecting a column or row and column now works just like before (no tuple required):

acs['age_1']

acs.loc[37, 'age_1']

Selecting all the columns for a variable or person gets more interesting. How can well tell Python to select all the variables that start with age or all the variables that end with 1? The answer is the DataFrame filter() function.

The filter() function allows you to subset based on patterns in the column names. A very simple pattern is like=. This will select all the columns that contain the string you pass in. Select all the income columns by passing in like='income':

acs.filter(like='income')

This works fine for now, but if our data set had a household_income variable it would be included too.

You can select all the columns for person 1 by passing in `like=’_1’:

acs.filter(like='_1')

Again, this works fine for now. But remember the original data had 16 columns per variable. like='_1' would also match the columns for persons 10 through 16.

We need the ability to be much more specific about the columns to be selected. The solution is regular expressions. Regular expressions are a syntax for describing a pattern in text so that a computer can detect text that matches it. They’ve been around for more than 50 years and most programming languages and statistical packages implement them. They can be used for a variety of purposes; we won’t discuss working with text data (strings) but we’re laying the foundation by introducing you to regular expressions. Regular expressions can be extremely complex. We’ll introduce you to just a few of the most important features but you can find many, many tutorials online.

You use a regular expression with filter() by passing in regex=. The simplest regular expression is just some text, and a string matches it if it contains that text. In other words, it does the same thing as like=:

acs.filter(regex='income')

But a regular expression can do much more. Here are the tools we’ll use:

If a regular expression starts with ^, then the string only matches if it starts with the pattern.
If a regular expression ends with $, then the string only matches if it ends with the pattern.
\d matches any digit (number).
A * after something means the string can contain any number of that thing at that point. So \d* means the string can contain any number of digits at that point.

Putting this all together, the regular expression ^income_\d*$ will match strings that start with ‘income_’ followed by any number of digits and then end. This will allow us to match the columns containing the income for each person (income_1, income_2, etc.) without also matching columns like household_income:

acs.filter(regex='^income_\d*$')

The regular expression _1$ will match strings that end with ’_1’. This excludes variables that end with ’_10’, ’_11’, etc. and thus allows us to select just the columns describing person 1:

acs.filter(regex='_1$')

Exercise

Use a regular expression to select all the race columns. Then use one to select all the variables for person 4.

Solution

acs.filter(regex='^race_\d*$')

acs.filter(regex='_4$')

6.3.2 Creating Level 1 Variables

Creating level 1 variables calls for a for loop just like with a MultiIndex. Recall that adding strings concatenates them in Python, so the equivalent of the tuple ('race', person), where person is a numeric variable, is the string 'race_' + str(person). Thus the code to create indicator variables for ‘this person is Black’ becomes:

for person in range(1, 5):
    acs['black_' + str(person)] = np.where(
        acs['race_' +str(person)].isna(),
        np.NaN, 
        (acs['race_' + str(person)]=='Black').astype('boolean')
    )

acs.filter(regex='^black_\d*$')

Exercise

Create an indicator variable for ‘this person is a child’.

Solution

for person in range(1, 5):
    acs['child_' + str(person)] = np.where(
        acs['age_' +str(person)].isna(),
        np.NaN, 
        (acs['age_' + str(person)]<18).astype('boolean')
    )

acs.filter(regex='^child_\d*$')

6.3.3 Creating Level 2 Variables

Creating level 2 variables is just a matter of replacing the xs() we used previously with the appropriate filter(). To calculate the household income, use:

acs['household_income'] = acs.filter(regex='^income_\d*$').sum(axis=1)

acs.filter(like='income')

Exercise

Create an indicator for ‘this household consists entirely of children’.

Solution

acs['all_children'] = acs.filter(regex='^child_\d*$').min(axis=1)

acs.filter(like='child').loc[73731]

6.3.4 Subsetting With Aggregate Functions

Finally, subsetting with an aggregate function is again a combination of creating level 1 variables containing just the values of interest with a loop and the aggregating them.

for person in range(1, 5):
    acs['income_if_child_' + str(person)] = acs['income_' + str(person)].where(
        acs['age_' + str(person)]<18,
        other=np.NaN
    )

acs.filter(regex='^income_if_child_\d*$').loc[8787]

acs['child_income'] = acs.filter(regex='^income_if_child_\d*$').sum(axis=1)

acs.filter(like='income').loc[8787]

Exercise

Create a level 2 variable containing the mean age of the adults in each household.

Solution

for person in range(1, 5):
    acs['age_if_adult_' + str(person)] = acs['age_' + str(person)].where(
        acs['age_' + str(person)]>=18,
        other=np.NaN
    )

acs['mean_adult_age'] = acs.filter(regex='^age_if_adult_\d*$').mean(axis=1)
acs.filter(like='age')