17 Merging
Merging is the process of combining multiple datasets into a single dataset. Examples include adding inflation factors to panel data to adjust income to today’s rates, or adding county-level statistics to individual-level data.
Load dplyr
and the airquality
dataset.
library(dplyr)
air <- airquality
The actual act of merging two datasets usually involves only one or two lines of code. Most of the work when merging happens before this, in preparing dataframes for merging and in deciding which rows we want in the output.
17.1 Preparing Dataframes for Merging
The air
dataframe has a numeric vector of month numbers stored in the Month
column.
air$Month
[1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6
[45] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[89] 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9
[133] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
As an exercise in merging, we can create a small dataframe with month names and then merge that with the air
dataframe.
month.name
is a character vector of month names included with R.
month.name
[1] "January" "February" "March" "April" "May" "June" "July"
[8] "August" "September" "October" "November" "December"
How can we add this information to the air
dataframe? We need a key, an identifier that we can use to match one set of information to another. A common example of a key in the social sciences is a person ID, which can be used to link information across multiple datasets.
We first need to turn this character vector into a dataframe:
myMonths <- as.data.frame(month.name)
Now, we can add a key column. A good choice would be the month number, since this is already in the air
dataframe, and it is quite easy to add to myMonths
since the months are already sorted chronologically. Create a new column with the name Month
. Having the same name for the same data in two dataframes makes merging easier.
myMonths <-
myMonths %>%
mutate(Month = 1:12)
Sometimes you will have data that is in one dataframe but not in another. Maybe an individual dropped out of a study, or their data was lost, or they could not be contacted for a follow-up survey. Whatever the reason, this is a common problem, so for practice, drop June from myMonths
with filter()
. (See the chapter on Subsetting for more about filter()
.)
In other words, the case just described is when a row in one dataset has no matches in another. Another case is when a row in one dataset has multiple matches in the other. To illustrate this, we can add a duplicate of May called “May2” with add_row()
.
myMonths <-
myMonths %>%
filter(!month.name %in% "June") %>%
add_row(month.name = "May2", Month = 5)
Finally, rename the month.name
column. It can be confusing if month.name
and myMonths$month.name
have the same name but are different vectors with different lengths.
myMonths <-
myMonths %>%
rename(Month_Name = month.name)
17.2 Merging Two Dataframes
Now that we have a dataframe with month names, we can merge it with our air
dataframe.
Base R and the tidyverse have their own functions for merging. Both work well in pipes and provide you options for which rows to keep from the first (“x” or “left”) and second (“y” or “right”) dataframes.
Below are four pairs of equivalent functions, where the first in each pair uses merge()
from base R and the second uses one of the _join()
functions from dplyr
. Note that the default settings of merge()
make it equivalent to inner_join()
.
We do not always need to explicitly name our key column (i.e., we can choose to not type by = "Month"
) and the functions will figure out the key column on their own. However, it is good practice to do so, to maintain knowledge and control of your code.
Remember, air
only has months May through September, while myMonths
has everything but June and it has “May” and “May2”. Each dataframe has at least one month that the other does not. Try out each of the functions below and note how they handle:
- June, which is in
air
but not inmyMonths
- January-April and October-December, which are in
myMonths
but not inair
- May and May2, both of which are recorded as the fifth month in
myMonths
# rows in x AND y
merge(air, myMonths, by = "Month")
inner_join(air, myMonths, by = "Month")
# rows in x OR y
merge(air, myMonths, by = "Month", all = T)
full_join(air, myMonths, by = "Month")
# rows in x
merge(air, myMonths, by = "Month", all.x = T)
left_join(air, myMonths, by = "Month")
# rows in y
merge(air, myMonths, by = "Month", all.y = T)
right_join(air, myMonths, by = "Month")
dplyr
has three other joining functions not offered by merge()
.
Two of these are semi_join()
and anti_join()
, which merge and subset in one step, do not include columns that exist only in the second dataframe. semi_join()
returns rows from the first dataframe with a match in the second, and anti_join()
returns rows from the first that do not have a match in the second.
These two functions are useful when we work with multiple datasets and we want to quickly limit one by the other. We might have one dataset with individual-level survey data, and another with county-level data, but we do not have county-level data for all individuals. We can quickly find individuals with data in both datasets with semi_join()
. Or, we might have longitudinal data stored in multiple datasets, one for each time point. If we want to analyze individuals who dropped out between time points, we can easily find them with anti_join()
.
Of course, we could also do these in two steps, first using one of the other _join()
functions and then using filter()
to find rows that do or do not have values from the second datasets. (For two steps, first use a _join()
function and then filter(!is.na(var_name))
or filter(is.na(var_name))
. See chapter on Subsetting for using filter()
.)
Using our datasets, the Month_Name
column is not in the output. If any rows in the first dataframe have multiple matches in the second (“5” matches to both “May” and “May2”), only one row is returned in the output, unlike the other _join()
functions above where we have one copy with “May” and another copy with “May2”.
air
has months 5-9 and myMonths
has months 1-5 and 7-12. Which months will be in the output of each line below? Make predictions, and then run each line on your own computer.
semi_join(air, myMonths, by = "Month")
semi_join(myMonths, air, by = "Month")
anti_join(air, myMonths, by = "Month")
anti_join(myMonths, air, by = "Month")
The final merging function offered by dplyr
is nest_join()
, which returns a dataframe with an additional list-column of dataframes. Each value in this new column is a dataframe of all matching rows from the second dataframe.
test <- nest_join(air, myMonths, by = "Month")
test[[7]][[1]] # May - two matches
# A tibble: 2 x 1
Month_Name
<chr>
1 May
2 May2
test[[7]][[32]] # June - no matches
# A tibble: 0 x 1
# ... with 1 variable: Month_Name <chr>
test[[7]][[62]] # July - one match
# A tibble: 1 x 1
Month_Name
<chr>
1 July
The column of dataframes can be unpacked, or “unnested”, with the unnest()
function from the tidyr
package.
The cols
argument needs to be given the name of our list-column, which is the name of the second dataframe in our original nest_join()
call.
If keep_empty = F
(the default), unnest()
will drop rows without matches in the second dataframe, so we will lose June. This makes the output the same as we got with inner_join()
earlier.
If keep_empty = T
, the rows without matches will be retained, and we will end up with the dataframe we got with full_join()
.
library(tidyr)
unnest(test, cols = myMonths)
unnest(test, cols = myMonths, keep_empty = T)
17.2.1 Merging Two Dataframes with Different Key Column Names
Sometimes your key variable will have different names in your dataframes. One option would be to rename one to match the other, as we did above when we created a Month
column in myMonths
. Another choice is to use the by
arguments of each function, as shown below. Change the name of the key variable in myMonths
from Month
to month_number
.
myMonths <-
myMonths %>%
rename(month_number = Month)
merge(air, myMonths, by.x = "Month", by.y = "month_number")
inner_join(air, myMonths, by = c("Month" = "month_number"))
17.3 Appending Rows
Merging combines the columns of two dataframes. To append (add) rows from one or more dataframes to another, use the bind_rows()
function from dplyr
.
This function is especially useful in combining survey responses from different individuals. bind_rows()
will match columns by name, so the dataframes can have different numbers and names of columns and rows.
# fake survey responses
survey001 <- as.data.frame(t(c(1, 1, "Sometimes", 6, 7)))
survey002 <- as.data.frame(t(c(4, 3, 1, 1)))
survey003 <- as.data.frame(t(c(6, 1, 2, 2, 4, "Never")))
# question names, in different orders
colnames(survey001) <- c("Q1", "Q2", "Q2a", "Q3", "Q4")
colnames(survey002) <- c("Q3", "Q4", "Q1", "Q2")
colnames(survey003) <- c("Q3", "Q2", "Q2a", "Q1", "Q4", "Q4a")
bind_rows(survey001, survey002, survey003)
Error: Can't combine `..1$Q1` <character> and `..2$Q1` <double>.
Uh-oh. When combining columns from different dataframes, you generally will not see errors like this, since columns can contain different types. However, each column can contain only one type of data, so rows must be the same data type.
You will commonly get errors when adding rows from one dataframe to another. If you have a manageable number of dataframes, as we do here, you can manually convert the columns before combining. Here, if we run str()
for the three dataframes, we see that all columns are characters in survey001
and survey003
, while all columns in survey002
are numeric.
str(survey001)
'data.frame': 1 obs. of 5 variables:
$ Q1 : chr "1"
$ Q2 : chr "1"
$ Q2a: chr "Sometimes"
$ Q3 : chr "6"
$ Q4 : chr "7"
str(survey002)
'data.frame': 1 obs. of 4 variables:
$ Q3: num 4
$ Q4: num 3
$ Q1: num 1
$ Q2: num 1
str(survey003)
'data.frame': 1 obs. of 6 variables:
$ Q3 : chr "6"
$ Q2 : chr "1"
$ Q2a: chr "2"
$ Q1 : chr "2"
$ Q4 : chr "4"
$ Q4a: chr "Never"
Coerce all columns in survey002
to character, and then try bind_rows()
again.
survey002 <- apply(survey002, 2, as.character)
surveys <- bind_rows(survey001, survey002, survey003)
surveys
Q1 Q2 Q2a Q3 Q4 Q4a
1 1 1 Sometimes 6 7 <NA>
2 1 1 <NA> 4 3 <NA>
3 2 1 2 6 4 Never
Even though the columns of the individual dataframes were in different orders, they were re-ordered to match one another. Wherever a column did not exist in one of the dataframes (see Q2a and Q4a), the column was populated with NA
.
While coercing the columns to characters did allow us to bind the rows of all our dataframes, we probably want some columns to be different types for a later analysis. Right now, we cannot calculate mean(surveys$Q1)
since it is a character vector.
mean(surveys$Q1)
Warning in mean.default(surveys$Q1): argument is not numeric or logical: returning NA
[1] NA
We could go through each column and convert the type (surveys$Q1 <- as.numeric(surveys$Q1)
, etc.), or we could make use of type_convert()
from the readr
package. This function guesses the “true” type of each column and converts it for us, letting us know which type it chose for each column.
library(readr)
surveys <- type_convert(surveys)
-- Column specification ----------------------------------------------------------------------
cols(
Q1 = col_double(),
Q2 = col_double(),
Q2a = col_character(),
Q3 = col_double(),
Q4 = col_double(),
Q4a = col_character()
)
mean(surveys$Q1)
[1] 1.333333
17.4 Merging Exercises
- Merge the
beaver1
andbeaver2
datasets on thetime
column, and include all rows.
- Bonus: Add a column called
warmer
that has a 1 whenever beaver1’s temperature was higher, and a 2 whenever beaver2’s temperature was higher. Ignore missing values. (If beaver1 is missing and beaver2 is not, the new column should say 2.) If the two beavers’ temperatures are equal, make the value ofwarmer
missing.
Append the rows of
beaver2
tobeaver1
. Make sure there is a column that specifies the beaver number (1 or 2) for each observation.Combine
state.abb
,state.division
, andstate.name
into a single dataframe.
- Drop all rows where the state name is more than one word.
- Merge this dataframe with
state.x77
. - Bonus: Add two columns corresponding to the two objects in the list
state.center
.