Introduction

This page has exercises and summaries that supplement the materials in Data Wrangling with R.

Each section of this page corresponds to a chapter in Data Wrangling with R and has four subsections:

  • Warmup: Exercises that introduce you to some of the concepts in the chapter. Some will ask you to use your current skills to solve a problem that is more easily solved with that chapter’s materials, while others illustrate situations where materials become useful.
  • Outcomes: An overview of the objectives, purpose, skills, and functions for each chapter. These are also accessible as a standalone document.
  • Materials: A link to the chapter of Data Wrangling with R.
  • Exercises: Opportunities to practice the essential skills from each chapter. The skills required to complete the “Fundamental” exercises are necessary for most data wrangling tasks, while the “Extended” exercises require skills that are either quicker ways of completing “fundamental” tasks or less commonly needed in data wrangling. Note that some of the exercises match those found at the end of each chapter in the materials, and others are unique to this page.

Data Objects

Warmup

sample(1:10, 1) gives us a random number between 1 and 10.

Run the two blocks of code below in R.

  1. In the first block, after you print x to the console, is x + 1 what you expect?

  2. In the second block, after you print sample(1:10, 1), is sample(1:10, 1) + 1 what you expect?

Run both blocks several more times. Do either of your answers change? Why?

x <- sample(1:10, 1)
x
x + 1

sample(1:10, 1)
sample(1:10, 1) + 1

Outcomes

Objective: To create, modify, and remove data objects.

Why it matters: Almost all of your work in R will involve data objects, everything from importing datasets to creating plots to fitting statistical models. An understanding of basic object operations is foundational for all your work in R.

Learning outcomes:

Fundamental Skills Extended Skills
  • Create and modify objects through assignment.

  • List all objects in the environment.

  • Remove individual objects from the environment.

  • Apply style guidelines when naming objects.

  • Remove all objects from the environment.

Key functions and operators:

<-
ls()
rm()

Materials

Chapter 1 of Data Wrangling with R

Exercises

Fundamental

  1. Give x the value 3. Then give it the value 5. Print x after each command to check its value.

  2. List all objects in the environment.

  3. Remove x from the environment.

Extended

  1. Make these object names syntactic, consistent, and meaningful. There is no one right answer. You can apply your own style and make decisions about what a name might mean.

    income1
    INCOME2
    income2
    3income
    birth date
    y
    year_of_birth
    state$of$residence
  2. Run this code, which will create 26 objects in your environment. Then, remove all of them.

    for (i in 1:length(letters)) { 
      assign(letters[i], i) 
    }

Data Types

Warmup

Add together values of the same and different types: character + character, character + numeric, etc. For example, try "a" + 1.

  • Character: quoted values, such as "a" or "rstudio"
  • Numeric: numbers with or without decimals, such as 10 or 2.72
  • Logical: TRUE, FALSE, and NA

Which combinations return errors? Which combinations return expected results? Which combinations return unexpected results?

Outcomes

Objective: To find and modify an object’s type.

Why it matters: Type is a fundamental property of data objects in R, and understanding how R handles various data types will help you identify sources of errors and perform useful operations, such as summarizing indicator variables.

Learning outcomes:

Fundamental Skills Extended Skills
  • Name the three fundamental data types in R.

  • Change an object’s type through coercion.

  • Understand how information is preserved or lost when moving up or down the hierarchy of data types.

Key functions and operators:

[ ]
as.numeric()
as.character()
as.logical()
typeof()

Materials

Chapter 2 of Data Wrangling with R

Exercises

Fundamental

  1. What is the type of each of these objects?

    a <- mtcars[1, 1]
    b <- letters[5]
    d <- (WorldPhones[4, 3] > 50000)
    e <- names(airquality)[3]
    f <- max(airquality$Day) == 31
    g <- mean(airquality$Temp)
  2. Use the six objects created in exercise 1. Coerce each one to the other two types with as.logical(), as.numeric(), and as.character().

Extended

  1. Review the hierarchy of data types in the details in ?c. Then, revisit exercise 2 above. When is information preserved or lost: when moving up or down the hierarchy?
  • Can you find any values that “survive” coercions up and down the hierarchy of logical-numeric-character?

Data Structures

Warmup

The code below creates three objects, x, y, and z.

x <- y <- 1:16
dim(y) <- c(4, 4)
z <- as.data.frame(y)

Print each one to the console.

  1. Which two objects are most similar to each other?

  2. What is unique about each object?

Outcomes

Objective: To explain the difference between the four basic data structures and the data types they can contain.

Why it matters: Data wrangling, modeling, plotting, and programming all require you to construct, manipulate, and use different data structures.

Learning outcomes:

Fundamental Skills
  • Name how many data types different structures can contain.

  • Create data objects of various structures.

  • View the structure of a data object.

Key functions and operators:

:
matrix()
array()
data.frame()
list()
str()

Materials

Chapter 3 of Data Wrangling with R

Exercises

Fundamental

  1. Structures and data types:

    • What structures are mtcars and chickwts?
    • Use str() to find the type of each column.
    • Convert each one to a matrix with as.matrix().
    • Use str() to find the type of each column. Did anything change? Why?
  2. Creating structures:

    • What structure is letters?
    • Create a two-column matrix from letters.
    • Create a dataframe from letters.
    • Create a list of letters and LETTERS.
  3. Use str() to explore the structure of each object you created in exercise 2.


Data Class

Warmup

Run this code from the previous section on Data Structures.

x <- y <- 1:16
dim(y) <- c(4, 4)
z <- as.data.frame(y)

Use plot() to plot x, y, and z.

What is shown in each plot?

Outcomes

Objective: To check an object’s class and understand how generic functions behave differently with objects of different classes.

Why it matters: To check an object’s class and understand how generic functions behave differently with objects of different classes.

Learning outcomes:

Fundamental Skills
  • Find the class(es) of an object.

  • Use generic functions with objects of different classes.

  • Find the object classes that a function supports.

Key functions:

class()
print()
summary()
plot()
methods()

Materials

Chapter 4 of Data Wrangling with R

Exercises

Fundamental

  1. Run the code below.

    mod <- lm(mpg ~ wt * vs, mtcars)
    mod_summary <- summary(mod)
    • What are the classes of mod and mod_summary?
    • Print each one. Which one prints more useful information?
    • Use plot() and summary() with each object. Explain why the output differs for the two objects.
  2. What object classes does plot() support? (Hint: see ?plot.)


Numeric Vectors

Warmup

What would be the result of adding together these pairs of vectors?

Make predictions before running the code.

1 + 3

2 + c(1, 3, 5)

c(2, 5) + c(1, 3)

c(2, 5) + c(1, 3, 5)

Outcomes

Objective: To create, perform mathematical operations with, and reference elements of numeric vectors.

Why it matters: Numeric vectors are used in everyday R tasks, such as creating or manipulating variables in a dataset, plotting predicted values from a statistical model, or writing loops that repeat lines of code.

Learning outcomes:

Fundamental Skills Extended Skills
  • Create simple numeric vectors.

  • Explain elementwise operations and recycling in the context of vector operations.

  • Reference vector elements by position.

  • Name the elements of a vector.

  • Reference vector elements by name.

Key functions and operators:

c()
rep()
seq()
:
sample()
runif()
rnorm()
[ ]

Materials

Chapter 5 of Data Wrangling with R

Exercises

Fundamental

  1. Reproduce the following vector with rep():

    [1] 1 1 1 1 1
  2. Reproduce the following vector with seq():

    [1]  0  2  4  6  8 10
  3. Reproduce the following vector in at least two ways:

    [1] 1 3 5 1 3 5
  4. Revisit the warm-up exercises and explain what, if anything, is being recycled in each case.

    1 + 3
    2 + c(1, 3, 5)
    c(2, 5) + c(1, 3)
    c(2, 5) + c(1, 3, 5)
  5. Make a vector with the numbers 1 to 10. What is its type? What is its mean? Replace the first five elements with your name. What is its type now? What is its mean now?

Extended

  1. Make a vector with the numbers 1 to 26. Assign the names A-Z to the elements (see LETTERS).

    • Get the 10th element by its name.
    • Assign the value 50 to the element whose name is “X”.
  2. Make a vector with the numbers 2 to 100, counting by 2s (2, 4, 6, …, 100). Assign the names A-J to elements 1-10, then again to elements 11-20, 21-30, 31-40, and 41-50.


Logical Vectors

Warmup

Run each line of code. What does each line do?

x <- runif(5)
mean(x)
x > mean(x)
mean(x > mean(x))

Outcomes

Objective: To compare vectors with logical operators and use logical-to-numeric coercion for summarizing data.

Why it matters: Logical vectors are used in common data wrangling applications, such as creating new variables, recoding existing variables, and subsetting datasets.

Learning outcomes:

Fundamental Skills Extended Skills
  • Compare two vectors.

  • Combine multiple comparisons with Boolean operators.

  • Coerce logical values to numeric values to summarize data.

  • Index a vector with a logical vector.

Key values and operators:

TRUE
FALSE
NA
>
>=
<
<=
==
!=
&
|
!
%in%
ifelse()

Materials

Chapter 6 of Data Wrangling with R

Exercises

Fundamental

  1. Create a vector, x, with 15 random numbers between 0 and 1 (see runif()). Create a logical vector that indicates whether a value is greater than 0.5.

  2. Find elements of x that are greater than 0.3 and less than 0.6, or that are greater than 0.9.

  3. Create a new vector, y, with 1000 random numbers between 0 and 1. About 50% of values should be less than 0.5. Verify this.

Extended

  1. A typical use of logical comparison is to create an indicator variable. Create an object called high_mpg that indicates whether a given value of mtcars$mpg has a value greater than the mean of mtcars$mpg.

  2. What proportion of values in mtcars$mpg are greater than the mean?

  3. Using high_mpg, create a vector of the weights (mtcars$wt) of cars with high MPG.


Character Vectors

Warmup

You inherited a secondary dataset that has times formatted in HHMM, without the colon (:) separator (i.e., 1234 instead of 12:34). How would you go about converting these numbers into times?

952
956

How about these numbers?

1000
1004

Did your answer differ for the two sets? What would you do if they were all in a single variable?

952
956
1000
1004

Outcomes

Objective: To combine, separate, and substitute values in character vectors.

Why it matters: In your datasets, you may find that a single variable is spread across multiple columns (vectors), or that a single column contains multiple variables. At other times, you may need to clean character values by removing symbols or standardizing capitalization.

Learning outcomes:

Fundamental Skills
  • Combine multiple values into a single character value.

  • Separate a single character value into multiple values.

  • Substitute one value for another value.

Key functions:

paste()
paste0()
substr()
nchar()
sub()
gsub()

Materials

Chapter 7 of Data Wrangling with R

Exercises

Fundamental

  1. Combine the single-letter vectors letters and LETTERS into a two-letter vector that goes “Aa”, “Bb”, etc.

  2. Combine the vectors state.abb and state.name to underscore-separated values: “AL_Alabama”, “AK_Alaska”, etc.

  3. Separate this vector of four-digit years into centuries and two-digit years. For example, “2021” would become “20” and “21”.

    yrs <- c("1993", "1980", "1992", "1997", "2010", "1995")
  4. Separate this vector of prices into dollars and cents. For example, “$12.34” would become “12” and “34”.

    prices <- c("$225.59", "$15.95", "$958.39", "$679.69", "$941.42", "$737.33")

Extended

  1. Currency is sometimes denoted with both a currency symbol and commas. Convert these to numeric values.

    x <- c("$10", "$11.99", "$1,011.01")
  2. Some countries use a comma rather than a period to separate the decimal, and a period to as a delimiter. For example, instead of writing one thousand two hundred thirty-four dollars and fifty-six cents as $1,234.56, they may write it as $1.234,56. The currency symbol may also be placed after the amount, such as 20$ rather than $20. Convert these alternative currency expressions into numeric values:

    currency <- c("$1.234,56", "20$", "$12,99", "5.555 $")

Date Vectors

Warmup

What would you expect from these date operations?

  1. Today + 1

  2. Today - 1

  3. Today - Tomorrow

  4. Today + Tomorrow

  5. The mean of Yesterday and Today

  6. The standard deviation of Yesterday and Today

Outcomes

Objective: To convert strings into dates and get date components from dates.

Why it matters: Dates can be encoded in a wide variety of formats, so a first step in using dates is converting them into a format that R recognizes. Some research questions involve a specific component of a date, such as weekday versus weekend, or before or after a certain year, so extracting date components is an important skill in working with dates.

Learning outcomes:

Fundamental Skills Extended Skills
  • Convert character vectors to date vectors.

  • Extract components of a date.

  • Calculate the difference between two dates.

  • Increment or decrement a date.

Key functions:

as.Date()
Sys.Date()
mdy(), ymd(), etc.
year()
month()
day()
wday()
interval()
time_length()

Materials

Chapter 8 of Data Wrangling with R

Exercises

Fundamental

  1. Other software uses other conventions for labeling date values. SAS and Stata both print dates as “10apr2004” by default. Convert the following SAS/Stata dates to R Dates:

     10apr2004
     18jun2005
     21sep2006
     12jan2007
  2. Occasionally you will work with data where the month, day, and year components of dates are stored as separate variables. To convert these to dates, first paste them together. (Recall that, to reference a column in a dataframe, use $, as in df$day.)

     df <- data.frame(day = c(10, 18, 22),
                      month = c(4, 6, 9),
                      year  = c(2004, 2005, 2006))
  3. Using the extract vector of dates below, extract the years, months, days, and days of the week. How many are Wednesdays?

    extract <- ymd("2013-06-11", "2015-03-10", "2017-08-13", "2011-05-29", "2010-12-13")

Extended

  1. Calculate your age in years, months, and days, as of today (use Sys.Date()). Be sure to account for irregular month and year lengths.

  2. Using the last day of this month, add one, two, and three months. If the day does not exist in a month, make it roll back to the last day of the month. Then, make it roll forward to the first day of the next month.


Categorical Vectors

Warmup

Use this line of code to create a categorical variable object, x.

x <- factor(c("a", "c", "c", "c", "d", "b", "c", "b", "a", "d"))

Use the functions str() and plot() to explore the object.

  1. What do you notice about the order of the letters we gave to factor() and the order returned by str() and plot()?

  2. Change the order of the letters in the code. Then use str() and plot() again. What do you notice now?

Outcomes

Objective: To create factor vectors from other vectors and manipulate factors by changing their orders, labels, and levels.

Why it matters: Factors provide a way to include character data in modeling and plotting, and manipulating factors can change things like intercept and interaction terms in models, and axis orders in plots.

Learning outcomes:

Fundamental Skills Extended Skills
  • Create a factor vector from another vector.

  • Change the order of a factor’s levels.

  • Change the labels of a factor’s levels.

  • Reduce the number of levels in a factor by combining them.

  • Apply a function to change the labels of a factor’s levels.

Key functions:

factor()
levels()
fct_relevel()
fct_recode()
fct_relabel()
fct_collapse()

Materials

Chapter 9 of Data Wrangling with R

Exercises

Fundamental

  1. Create a random sample (with sample()) of state names (state.name) of size 1000. Convert it to a factor, and then make a table() of counts by state. Which is most common in your sample?

  2. Releveling: Using the iris dataset, plot counts by factor level with plot(iris$Species). Now, relevel Species so that versicolor is the reference (first) category. Plot it again. What do you notice?

  3. Recoding: In the mtcars data, all the variables are numeric. Convert vs to a factor, where 0 has the label “V-shaped” and 1 has the label “Straight”.

  4. Collapsing: mtcars$cyl has three different values: 4, 6, and 8. Convert it into a two-level factor, where 4 and 6 share the label “Few” and 8 has the label “Many”.

Extended

  1. Create a vector of 100 random numbers 1 to 10. Convert it to a factor. Then, rename them to start with “id” and end with “x”, like “id1x”, “id2x”, etc.

Reading Text Data

Warmup

What kinds of datasets have you used in R or other statistical software?

  • Built-in datasets, such as mtcars in R or auto in Stata
  • Text files (.txt)
  • Comma-separated values files (.csv)
  • Excel workbooks (.xlsx)
  • Statistical software binary data files (.rdata, .rds, .dat, .sas7bdat, .sav, etc.)
  • Other formats?

Outcomes

Objective: To import text data into R.

Why it matters: The first step in most research projects is importing a dataset into R. Data files vary in the data they include and how it is organized, so it is necessary for you to have strategies to import data in various formats.

Learning outcomes:

Fundamental Skills Extended Skills
  • Specify relative paths when importing datasets.

  • Read in datasets with comma- or space-separated values.

  • Read in datasets from Excel or Stata.

  • Save datasets as CSV and RDS files.

  • Read in datasets with missing values.

  • Read in datasets in fixed-width format.

Key functions:

read.csv()
write.csv()
saveRDS()
readRDS()
getwd()
setwd()
list.files()

Materials

Chapter 10 of Data Wrangling with R

Files for in-chapter exercises

Exercises

Fundamental

  1. Run this code to save two files to your working directory. Then, read them back into R.

    write.table(ChickWeight, "chick1.txt", sep = ",", row.names = F)
    write.table(ChickWeight, "chick2.txt", row.names = F)
  2. Save airquality as a CSV file.

  3. Save chickwts as an RDS file.

  4. Print your working directory to the console.

  5. Create a folder called “read_practice” on your computer and move the files you saved in exercises 2 and 3 into this folder. Place this folder either in your working directory or as a sibling folder to your working directory. Without changing your working directory, read them into R.

Extended

  1. Run this code, which creates a CSV in your working directory called “class_scores.csv”, where a value of -99 means missing. Read it back into R and ensure that the missing data is coded as NA.

    data.frame(id = sample(1:10, 100, replace = T),
               score = sample(c(90:100, -99), 100, replace = T)) |> 
      write.csv("class_scores.csv", row.names = F)
  2. A fixed-width version of mtcars is available at https://sscc.wisc.edu/sscc/pubs/data/dwr/mtcars_fwf.txt. Read it into R, following the information in the codebook at https://sscc.wisc.edu/sscc/pubs/data/dwr/mtcars_fwf_codebook.txt. Compare it to mtcars to ensure it matches.


First Steps with Dataframes

Warmup

Once you get a dataset, what are the first things you do with it? This could be data you or your lab collected, or a secondary dataset you downloaded.

Outcomes

Objective: To examine and modify datasets’ variables and values.

Why it matters: Secondary datasets may not have all the variables you need for your research question, the values they contain usually need to be modified, and the variable names should be changed for ease of use.

Learning outcomes:

Fundamental Skills
  • Create and save scripts and datasets for reproducibility.

  • Examine a dataset’s metadata and univariate summary statistics.

  • Rename individual variables.

  • Rename multiple variables.

  • Create new variables.

  • Change values to other values.

Key functions and operators:

|>
nrow()
ncol()
rownames()
colnames()
summary()
str()
rename()
mutate()
ifelse()

Materials

Chapter 11 of Data Wrangling with R

Exercises

Fundamental

  1. Start a script that loads dplyr and the sleep dataset.
  • If at any point you overwrite a built-in R dataset and want to retrieve the original, run rm(objectname), where objectname is the name of the data object. For example, if you run sleep <- 1 by accident, run rm(sleep) to delete the object from your environment so R can find the original again.
  1. Read the documentation at help(sleep).

  2. Examine the data. What type is each column? How are the data distributed? Is any data missing?

  3. Add a new column that says “One” if group is 1, and “Two” if group is 2.

  4. Replace extra with NA if it is below zero.

  5. Multiply extra by 60 so that it is minutes rather than hours.

  6. Change the name of extra to Extra_Minutes.

  7. Make all variable names lowercase.

  8. Change extra_minutes to missing if id is 7.

  9. Save the dataset as an RDS file, and save your script.


Subsetting Dataframes

Warmup

Imagine you have a dataset with these column names:

id race edu a11 b11 c11 a12 b12 c12 a13 b13 c13 a14 b14 c14 income2022 income2024

You realize you only need some of these for your research questions. To systematically select these columns, we need to think of “rules” that would return the column names we want.

For example, if we wanted a11 a12 a13 a14, our rule would be “starts with ‘a’.”

What rules would return these sets of columns?

  1. a11 b11 c11 a12 b12 c12

  2. id race edu

  3. a12 b12 c12 a14 b14 c14 income2022 income2024

Outcomes

Objective: To systematically subset datasets by selecting columns and rows.

Why it matters: When working with secondary datasets that have hundreds of variables and millions of observations, this skill is critical for creating a manageable dataset suited to your research question. Subsetting is also useful in the analysis of any dataset in order to exclude certain cases or to perform subgroup analyses.

Learning outcomes:

Fundamental Skills Extended Skills
  • Select columns by name or by pattern.

  • Select rows with simple criteria.

  • Select columns by range or type.

Key functions and operators:

select()
starts_with()
ends_with()
contains()
filter()

Materials

Chapter 16 of Data Wrangling with R

Exercises

Fundamental

Use the state.x77 dataset for 1 and 2. You will need to first convert it into a dataframe with as.data.frame().

  1. Drop the Population column.

  2. Drop columns with spaces in their names.

Use the airquality dataset for 3 and 4.

  1. Drop rows from September.

  2. Drop rows where Ozone or Solar.R are missing.

Extended

  1. Select columns of type factor from ToothGrowth, and then columns of type numeric. Repeat with warpbreaks.

  2. Run the code below to create a sample dataset. Then, select all columns whose names start with a letter of your choice and end with a year between 2010 and 2015.

    dat <- 
      matrix(rnorm(1e5), ncol = 500) |> 
      as.data.frame()
    
    colnames(dat) <-
      paste0(rep(letters, each = 22),
             rep(2000:2021, times = 26)) |> 
      sample(500)
  3. Using the dataset from (2), drop rows where a column of your choice only has values in the range [-3, 3].


Merging Dataframes

Warmup

Using Excel (or Google Sheets, Word, a text editor, a piece of actual paper, etc.), combine these three tables together into one table.

id year income
32 2000 42000
32 2001 43000
32 2002 49000
32 2003 50000
38 2002 36000
38 2003 36000
39 2001 18000
39 2002 18500
42 2000 76000
year income_adj
2000 1.53
2001 1.47
2002 1.46
2003 1.42
person_id state
30 MN
32 WI
38 IA
39 WI

Outcomes

Objective: To combine two or more datasets into a single dataset.

Why it matters: Data for your research question may be spread across multiple datasets, and it is necessary to first consolidate all the data in order to model and visualize it.

Learning outcomes:

Fundamental Skills Extended Skills
  • Merge two datasets on one or more key columns.

  • Append two or more datasets.

  • Merge two datasets with different key column names.

  • Append two or more datasets with columns of different types.

Key functions and operators:

full_join()
inner_join()
left_join()
right_join()
bind_rows()

Materials

Chapter 17 of Data Wrangling with R

Exercises

Fundamental

  1. Use the code below to make two subsets of mtcars. Then merge them together, and include all rows. You should have 32 rows and 9 columns in the output.

    mtcars <- mtcars |> mutate(id = row.names(mtcars))
    mtcars1 <- mtcars |> filter(mpg >= 21) |> select(id, cyl:wt)
    mtcars2 <- mtcars |> filter(mpg < 25) |> select(id, drat, vs, am, gear)
  2. Append the rows of beaver2 to beaver1. Make sure there is a column that specifies the beaver number (1 or 2) for each observation.

Extended

  1. Read in the three datasets shown in the warm-up exercises, and merge them into a single dataset.

    merge1 <- read.csv("https://sscc.wisc.edu/sscc/pubs/data/dwr/merge1.csv")
    merge2 <- read.csv("https://sscc.wisc.edu/sscc/pubs/data/dwr/merge2.csv")
    merge3 <- read.csv("https://sscc.wisc.edu/sscc/pubs/data/dwr/merge3.csv")
  2. Append dat2 to dat1.

    dat1 <- 
      data.frame(id = letters[1:13],
                 score = rnorm(13, 100, 15))
    
    dat2 <- 
      data.frame(id = letters[14:26],
                 score = c(rnorm(12, 100, 15), "missing"))

Aggregating Dataframes

Warmup

Four datasets are presented below. Dataset 1 was used to create Datasets 2-4. Discuss the relationships between the four datasets:

  1. How do the four differ in terms of the numbers of their rows and columns?
  2. How were the columns mean1 and mean2 calculated?
  3. Which columns have repeated values?

Dataset 1:

id wave value
a 1 5
a 2 4
a 3 9
b 1 3
b 2 10
b 3 8

Dataset 2:

id wave value mean1 mean2
a 1 5 6 4
a 2 4 6 7
a 3 9 6 8.5
b 1 3 7 4
b 2 10 7 7
b 3 8 7 8.5

Dataset 3:

id mean1
a 6
b 7

Dataset 4:

wave mean2
1 4
2 7
3 8.5

Outcomes

Objective: To produce summary statistics along one or more grouping variables.

Why it matters: Much of social science data is inherently multilevel: individuals in families, counties in states, gross domestic product within countries across years, and so on. We need to create variables like the number of individuals in a family, or the mean vote share by county for some political party.

Learning outcomes:

Fundamental Skills
  • Summarize variables of interest across one or more grouping variables.

  • Add group-level information to a dataframe without removing rows.

  • Find and remove a dataframe’s grouping variables.

Key functions and operators:

group_by()
summarize()
ungroup()
group_vars()

Materials

Chapter 18 of Data Wrangling with R

Exercises

Fundamental

Use the chickwts dataset.

  1. Is this a balanced experiment (same number of individuals/chickens in each condition/feed)?

  2. Which feed was associated with the largest variation (standard deviation) in weight?

  3. Without reducing the number of rows, add a column with the range (maximum - minimum) of weight for each feed.

Extended

Run this code.

dat <-
  mtcars |> 
  group_by(gear, cyl, vs) |> 
  summarize(mpg_avg = mean(mpg))
  1. What are the grouping variables of dat?

  2. Add two new columns, vs_count1 and vs_count2:

    • Run this code to add a column vs_count1 with the number of unique values of vs.

      dat <- mutate(dat, vs_count1 = length(unique(vs)))
    • Remove the grouping variables, and then create the column vs_count2 using the code below.

      dat <- mutate(dat, vs_count2 = length(unique(vs)))
  3. Why do vs_count1 and vs_count2 differ?


Reshaping Dataframes

Warmup

Load these two datasets into R:

long <- read.csv("https://www.sscc.wisc.edu/sscc/pubs/data/dwr/reshape_exercise_long.csv")
wide <- read.csv("https://www.sscc.wisc.edu/sscc/pubs/data/dwr/reshape_exercise_wide.csv")

long and wide have the same data but are organized differently.

Work with one or two partners. At least one of you should attempt #1 with both long and wide, and at least one of you should attempt #2 with both long and wide. If you get stuck, try the task with the other dataset. Then discuss with your partner(s): was your task easier with one of the two datasets? Which one?

  1. Calculate each ID’s average across years, resulting in this output:
 id   avg
  1 13.75
  2  6.25
  3  7.25
  4  9.50
  1. Add a variable with each ID’s sum of values from 2000 and 2001.

Outcomes

Objective: To change the shape of data, from long to wide or from wide to long.

Why it matters: Different data wrangling operations and analyses are easier done in either long or wide format, so reshaping data is important in preparing and modeling data.

Learning outcomes:

Fundamental Skills Extended Skills
  • Determine the shape of a dataframe.

  • Reshape a dataframe from long to wide.

  • Reshape a dataframe from wide to long.

  • Customize column names when reshaping from long to wide.

  • Drop missing data when reshaping from wide to long.

Key functions:

pivot_wider()
pivot_longer()

Materials

Chapter 19 of Data Wrangling with R

Exercises

Fundamental

  1. Convert the WorldPhones dataset into a dataframe, and add a column called “Year” from its row names. (Try something like WorldPhones |> as.data.frame() |> mutate(Year = row.names(WorldPhones)).) Then reshape it into a long format.

  2. Reshape ChickWeight into a wide format with columns created from Time.

Extended

  1. Reshape sleep to wide with columns created from group and values from extra. Make sure that the new column names are syntactic.

  2. Run this code to create sample data. Reshape dat to long, and drop rows where the value is missing.

    dat <-
      matrix(sample(c(1:10, NA), 100, replace = T),
             nrow = 10) |> 
      as.data.frame() |> 
      mutate(id = letters[1:10])

Finale

Exercises

These two exercises require skills from multiple chapters, and they are intended to (1) be challenging and (2) resemble real-life data wrangling tasks.

  1. How many observations in airquality have a value of Temp at least two standard deviations above the mean value of Temp, where the mean and SD are calculated separately for each Month?

  2. Reshape us_rent_income (from the tidyr package) so that it has one line per state, and two new columns named estimate_income and estimate_rent that contain values from estimate. Add a column with the proportion of income spent on rent (12 * rent / income). Merge it with the dataframe created by this code: data.frame(state.name, state.abb, state.division). Which division has the lowest average rent/income ratio? (Do not weight states when averaging by division.)