15 Reshaping
In a data frame, each row represents a unit of observation and each column represents a measurement. For example, each row might represent a respondent in a survey, and each column might represent their answers to a specific survey question.
Now suppose you re-surveyed the same respondents a year later, asking the same questions. You might consider your unit of observation to be the person, and record their new answers as a new set of columns in your data frame.
ID | income2020 | income2021 |
---|---|---|
01 | 33000 | 34000 |
02 | 25000 | 29000 |
But you might also consider the unit of observation to be the person-time combination, that is, person 01 at time one is organized as a distinct observation from person 01 at time two.
ID | year | income |
---|---|---|
01 | 2020 | 33000 |
02 | 2020 | 25000 |
01 | 2021 | 34000 |
02 | 2021 | 29000 |
This sort of data is often described as repeated measures, nested, or hierarchical data, and occurs when one conceptual unit of observation includes multiple other units of observation. Often the larger unit (“higher level”) is termed an “observation”, and we discuss measures that vary “within” these units and measure that vary “between” these units.
In data wrangling terms, the first example above is often described as wide form data, while the second example is long form. Whether you need your data in wide or long form often depends on what you want to do next with your data - some statistical and graphing functions expect data in wide form, while others expect data in long form.
Converting a data frame from long to wide, or vice versa, is called reshaping.
15.1 Reshape Wider
To reshape data from long form to wide form we need to identify three or four groups of variables:
- ID variable(s) - identify the units which are repeated
- Repetition - a variable that identifies the repetition within ID
- Varying - variables that differ within an ID
- Repeated - variables that are constant within an ID, but vary between IDs.
15.1.1 Simple Case
A simple example of long form data is the sleep
data.
sleep
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10
Individuals in the sleep study are given an ID
, and each appears
twice. The repetitions are given by group
, the two drugs being
studied. The variable extra
is varying within ID. (There are
no “between” variables here.)
The reshape looks like this:
wide0 <-
reshape(
sleep,
direction = "wide",
idvar = "ID", # obs unit, "i"
timevar = "group", # repetition, "j"
v.names = "extra"
)
wide0
ID extra.1 extra.2
1 1 0.7 1.9
2 2 -1.6 0.8
3 3 -0.2 1.1
4 4 -1.2 0.1
5 5 -0.1 -0.1
6 6 3.4 4.4
7 7 3.7 5.5
8 8 0.8 1.6
9 9 0.0 4.6
10 10 2.0 3.4
Notice that the two variables, extra.1
and extra.2
are a combination
of the varying variable name, extra
, and the values of the repetition
variable, group
. This is an important naming pattern when reshaping
from wide to long.
Be aware, too, that once reshaped, it is easy to reshape the data in the other direction. Try:
reshape(wide0) # reverse a reshape
15.1.2 Unbalanced Data
Now consider what happens if the repetitions are not balanced, that is, not every repeated unit has the same number of repetitions.
sleep1 <- sleep[-2,] # missing a rep in ID==2
wide1 <-
reshape(
sleep1,
direction = "wide",
idvar = "ID", # obs unit, "i"
timevar = "group", # repetition, "j"
v.names = "extra"
)
wide1
ID extra.1 extra.2
1 1 0.7 1.9
3 3 -0.2 1.1
4 4 -1.2 0.1
5 5 -0.1 -0.1
6 6 3.4 4.4
7 7 3.7 5.5
8 8 0.8 1.6
9 9 0.0 4.6
10 10 2.0 3.4
12 2 NA 0.8
Missing rows in long form become missing values in wide form.
15.1.3 Multiple “Within” Variables
Often, more than one variable varies within a repeated unit.
Specify these with a vector of v.names
.
sleep2 <- sleep
sleep2$other <- sample(20)
reshape(
sleep2, # more than one "within" var
direction = "wide",
idvar = "ID", # obs unit, "i"
timevar = "group", # repetition, "j"
v.names = c("extra", "other")
)
ID extra.1 other.1 extra.2 other.2
1 1 0.7 13 1.9 11
2 2 -1.6 8 0.8 12
3 3 -0.2 20 1.1 2
4 4 -1.2 15 0.1 7
5 5 -0.1 17 -0.1 18
6 6 3.4 9 4.4 5
7 7 3.7 14 5.5 16
8 8 0.8 4 1.6 1
9 9 0.0 19 4.6 3
10 10 2.0 6 3.4 10
15.1.4 Additional Repeated Variables
You may also have additional variables that are repeated at the ID level, but do not vary within ID. You do not need to specify this final group of “between” variables - anything not already named is assumed to vary “between”.
sleep3 <- sleep2
sleep3$site <- rep(sample(5),4)
wide3 <-
reshape(
sleep3,
direction = "wide",
idvar = "ID", # obs unit, "i"
timevar = "group", # repetition, "j"
v.names = c("extra", "other")
)
wide3
ID site extra.1 other.1 extra.2 other.2
1 1 5 0.7 13 1.9 11
2 2 2 -1.6 8 0.8 12
3 3 1 -0.2 20 1.1 2
4 4 4 -1.2 15 0.1 7
5 5 3 -0.1 17 -0.1 18
6 6 5 3.4 9 4.4 5
7 7 2 3.7 14 5.5 16
8 8 1 0.8 4 1.6 1
9 9 4 0.0 19 4.6 3
10 10 3 2.0 6 3.4 10
However, you do have to be careful to declare all of your “within” varying variables.
# WARNING, "time"-varying not declared
reshape( # more than one "within" var
sleep3, # missing obs
direction = "wide",
idvar = "ID", # obs unit, "i"
timevar = "group", # repetition, "j"
v.names = c("extra")
)
Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying = varying, : some
constant variables (other) are really varying
ID other site extra.1 extra.2
1 1 13 5 0.7 1.9
2 2 8 2 -1.6 0.8
3 3 20 1 -0.2 1.1
4 4 15 4 -1.2 0.1
5 5 17 3 -0.1 -0.1
6 6 9 5 3.4 4.4
7 7 14 2 3.7 5.5
8 8 4 1 0.8 1.6
9 9 19 4 0.0 4.6
10 10 6 3 2.0 3.4
It might be nice if this produced a clear ERROR rather than a WARNING and a resulting data frame!
15.2 Reshape Longer
To go from wide to long we need to invert all of these concepts.
- ID variable - identify the units which are repeated
- Repetition - a new variable name that will identify repetitions within ID
- Varying - variables that differ within an ID (groups of columns)
- Repeated - variables that are constant within an ID, but vary between IDs (single columns).
Here again, we will only actually specify the first three types of variables, one of which will be created. An important prerequisite is that varying columns have names which can be split into a shared part and a repetition value.
reshape(wide0,
direction = "long",
idvar = "ID", # i
timevar = "group", # j
varying = c("extra.1", "extra.2")
)
ID group extra
1.1 1 1 0.7
2.1 2 1 -1.6
3.1 3 1 -0.2
4.1 4 1 -1.2
5.1 5 1 -0.1
6.1 6 1 3.4
7.1 7 1 3.7
8.1 8 1 0.8
9.1 9 1 0.0
10.1 10 1 2.0
1.2 1 2 1.9
2.2 2 2 0.8
3.2 3 2 1.1
4.2 4 2 0.1
5.2 5 2 -0.1
6.2 6 2 4.4
7.2 7 2 5.5
8.2 8 2 1.6
9.2 9 2 4.6
10.2 10 2 3.4
reshape(wide3,
direction = "long",
idvar = "ID", # i
timevar = "group", # j
varying = c("extra.1", "extra.2",
"other.1", "other.2")
)
ID site group extra other
1.1 1 5 1 0.7 13
2.1 2 2 1 -1.6 8
3.1 3 1 1 -0.2 20
4.1 4 4 1 -1.2 15
5.1 5 3 1 -0.1 17
6.1 6 5 1 3.4 9
7.1 7 2 1 3.7 14
8.1 8 1 1 0.8 4
9.1 9 4 1 0.0 19
10.1 10 3 1 2.0 6
1.2 1 5 2 1.9 11
2.2 2 2 2 0.8 12
3.2 3 1 2 1.1 2
4.2 4 4 2 0.1 7
5.2 5 3 2 -0.1 18
6.2 6 5 2 4.4 5
7.2 7 2 2 5.5 16
8.2 8 1 2 1.6 1
9.2 9 4 2 4.6 3
10.2 10 3 2 3.4 10
15.3 Reshaping Exercises
Reshape the
WorldPhones
dataset into a long format. Be sure to name the new columns appropriately.Reshape
ChickWeight
into a wide format with columns created fromTime
.
15.4 Data Wrangling Exercises
Putting everything together now,
Reshape
us_rent_income
(from thetidyr
package) so that it has one line per state, and two new columns namedestimate_income
andestimate_rent
that contain values fromestimate
.Merge this with
state.x77
, and keep all rows. Then, drop rows where any values are missing. You can do this in one or two steps.Add a column containing
state.division
.Add a column with the proportion of income spent on rent (rent / income).
Drop rows where
Area
is not greater than ten timesFrost
.Replace all spaces in all column names with dashes (e.g.,
HS Grad
toHS-Grad
).Without removing any rows, add a column with the population-weighted mean rent by geographic division.
- Which division has the highest mean rent?
- Save the resulting dataframe as a CSV file, a tab-delimited text file, and an .RData file.