11 First Steps with Dataframes
The majority of your statistical work will revolve around data sets. For statistical work, a data set is a rectangular (matrix-like) arrangement of measurements (variables, columns) collected on units of observation that are arranged in rows. This structure is crucial to keep the correct measurements (data values) connected with each observation.
The typical data analysis project begins by reading a data set into R from some external file.
In R this structure is called a dataframe. A dataframe is an ordered list of vectors, where the vectors are all the same length. A dataframe also always has column names (variable names) and row names (often just the observation number as a character value). When indexing the elements of a dataframe we always have all three methods available: by position, by name, or by condition.
A variation on the dataframe is the tidyverse tibble class.
Tibbles have their own methods for some generic functions
(like print
). They also have data.frame
as a secondary
class - any tibble can be used as a dataframe.
The following examples use a subset of the American Community Survey (ACS) from 2000. This is the same data set used in the Data Wrangling in Stata curriculum. To follow along, download the dataset here.
When you start working with a data set, especially if it was created by somebody else (that includes past-you!), resist the temptation to start running models immediately. First, take time to understand the data. What information does it contain? What is the structure of the data set? What is the data type of each column? Is there anything strange in the data set? It’s better to find out now, and not when you’re in the middle of modeling!
11.1 Start Your Script
Now, create a new script with File - New File - New R Script or by clicking on the New File icon in the toolbar. Save this script with a sensible name, such as “01_cleaning.R”. We can imagine a series of scripts we might run after this one, such as “02_descriptive_statistics.R”, “03_regression.R”, “04_plots.R”, and so on.
The first few lines of a script should load libraries and read in our data. the dplyr
package includes a wide range of functions for manipulating dataframes that are essential for basic and advanced data wrangling.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
acs <- read.csv("2000_acs_sample.csv")
11.1.1 Piping
dplyr
loads the basic pipe operator %>%
(shortcut: ctrl + shift + m). (Other pipe operators are available in the magrittr
package.) The pipe uses the result of an expression as the first argument of the following expression.
We can write our code without pipes in a nested manner:
sqrt(mean(1:5))
[1] 1.732051
Or we can use pipes to take the 1:5
, give it to mean()
to calculate the average, and then take that average and pass it to sqrt()
to calculate the square root:
1:5 %>% mean() %>% sqrt()
[1] 1.732051
Using pipes makes writing and reading code easier. As we write strings of functions, we are less likely to misplace or miscount parentheses. As we read ours and others’ code, we see the operations in the order they are done. The nested example above starts with sqrt()
, which is the last operation we carry out, while the piped code reveals that the first thing we do is create a vector of one through five.
11.2 Look at the Data
A dataframe consists of rows called observations and columns called variables. The data recorded for an individual observation are stored as values in the corresponding variable.
Variable | Variable | Variable | … | |
---|---|---|---|---|
Observation | Value | Value | Value | |
Observation | Value | Value | Value | |
Observation | Value | Value | Value | |
… |
If you have a dataset already in this format, you are in luck. However, we might run into datasets that need a little work before we can use them. A single row might have multiple observations, or a single variable might be spread across multiple columns. Organizing, or tidying, datasets is the focus of the remainder of this book.
Now that we have the acs
dataset loaded, a first step in looking at our data is checking its dimensions, row names (if it has any), and column names.
nrow(acs)
[1] 28172
row.names(acs)[1:10] # just look at the first 10
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
ncol(acs)
[1] 16
colnames(acs)
[1] "year" "datanum" "serial" "hhwt"
[5] "gq" "us2000c_serialno" "pernum" "perwt"
[9] "us2000c_pnum" "us2000c_sex" "us2000c_age" "us2000c_hispan"
[13] "us2000c_race1" "us2000c_marstat" "us2000c_educ" "us2000c_inctot"
dim(acs) # nrow() and ncol() together
[1] 28172 16
Next, the summary()
function provides simple summary statistics for numeric vectors, and str()
will, in the case of dataframes, tell us the data type and the first few values of each column.
summary(acs)
year datanum serial hhwt gq
Min. :2000 Min. :4 Min. : 37 Min. :100 Length:28172
1st Qu.:2000 1st Qu.:4 1st Qu.: 323671 1st Qu.:100 Class :character
Median :2000 Median :4 Median : 617477 Median :100 Mode :character
Mean :2000 Mean :4 Mean : 624234 Mean :100
3rd Qu.:2000 3rd Qu.:4 3rd Qu.: 937528 3rd Qu.:100
Max. :2000 Max. :4 Max. :1236779 Max. :100
us2000c_serialno pernum perwt us2000c_pnum us2000c_sex
Min. : 92 Min. : 1.000 Min. :100 Min. : 1.000 Min. :1.000
1st Qu.:2395745 1st Qu.: 1.000 1st Qu.:100 1st Qu.: 1.000 1st Qu.:1.000
Median :4905730 Median : 2.000 Median :100 Median : 2.000 Median :2.000
Mean :4951676 Mean : 2.208 Mean :100 Mean : 2.208 Mean :1.512
3rd Qu.:7444248 3rd Qu.: 3.000 3rd Qu.:100 3rd Qu.: 3.000 3rd Qu.:2.000
Max. :9999402 Max. :16.000 Max. :100 Max. :16.000 Max. :2.000
us2000c_age us2000c_hispan us2000c_race1 us2000c_marstat us2000c_educ
Min. : 0.00 Min. : 1.00 Min. :1.000 Min. :1.000 Min. : 0.000
1st Qu.: 17.00 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 4.000
Median : 35.00 Median : 1.00 Median :1.000 Median :3.000 Median : 9.000
Mean : 35.92 Mean : 1.77 Mean :1.935 Mean :2.973 Mean : 7.871
3rd Qu.: 51.00 3rd Qu.: 1.00 3rd Qu.:1.000 3rd Qu.:5.000 3rd Qu.:11.000
Max. :933.00 Max. :24.00 Max. :9.000 Max. :5.000 Max. :16.000
us2000c_inctot
Length:28172
Class :character
Mode :character
str(acs)
'data.frame': 28172 obs. of 16 variables:
$ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ datanum : int 4 4 4 4 4 4 4 4 4 4 ...
$ serial : int 37 37 37 241 242 296 377 418 465 465 ...
$ hhwt : int 100 100 100 100 100 100 100 100 100 100 ...
$ gq : chr "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" "Households under 1970 definition" ...
$ us2000c_serialno: int 365663 365663 365663 2894822 2896802 3608029 4720742 5176658 5856346 5856346 ...
$ pernum : int 1 2 3 1 1 1 1 1 1 2 ...
$ perwt : int 100 100 100 100 100 100 100 100 100 100 ...
$ us2000c_pnum : int 1 2 3 1 1 1 1 1 1 2 ...
$ us2000c_sex : int 2 2 2 2 2 2 2 2 1 2 ...
$ us2000c_age : int 20 19 19 50 29 20 69 59 55 47 ...
$ us2000c_hispan : int 1 1 1 1 1 1 1 1 1 1 ...
$ us2000c_race1 : int 1 1 2 1 1 6 1 1 2 2 ...
$ us2000c_marstat : int 5 5 5 5 5 5 5 2 4 5 ...
$ us2000c_educ : int 11 11 11 14 13 9 1 8 12 1 ...
$ us2000c_inctot : chr "10000" "5300" "4700" "32500" ...
We can learn a few things about the data:
- it seems like year
, hhwt
, and perwt
are always the same values (note that the summary statistics are all a single number)
- several variables, such as us2000c_sex
and us2000c_race1
, are integers but their names suggest categorical variables
- gq
and us2000c_inctot
are character vectors, while all others are numeric
- us2000c_age
has a maximum value of 933, which seems implausibly high
To check that year
and the other variables are always the same, use the table()
function to reveal that the value 2000 occurs 28172 times.
table(acs$year)
2000
28172
Alternatively, the unique()
function in conjunction with the length()
function to find the number of unique values in a vector. One unique value means that every value is identical.
unique(acs$year) # 2000 is the only value
[1] 2000
acs$year %>% unique() %>% length() # 1 unique value
[1] 1
We may choose to drop these columns later on, and this can be done with subset(acs, select = -year)
or select(acs, -year)
(see the chapters on Subsetting).
11.3 Renaming Variables
11.3.1 Rename Individual Columns
The rename()
function in dplyr
allows for easy renaming of individual columns. The pattern is new_name = old_name
. We can change pernum
to “person” and serial
to “household”.
acs <-
acs %>%
rename(person = pernum,
household = serial)
11.3.2 Rename Multiple Columns
Several columns have the prefix “us2000c_”, which is a bit redundant since the data is all from the US and from the year 2000. Instead of renaming them one-by-one, we can rename several columns at once according to a pattern.
A useful function here is sub()
which allows us to substitute one character string for another. In this case, we can substitute “us2000c_” for ""(nothing), effectively deleting the prefix.
sub()
contains three arguments. It looks through the third and replaces the first with the second. Here, go through the column names of acs
and replace the first occurrence of “us2000c_” with "".
colnames(acs)
[1] "year" "datanum" "household" "hhwt"
[5] "gq" "us2000c_serialno" "person" "perwt"
[9] "us2000c_pnum" "us2000c_sex" "us2000c_age" "us2000c_hispan"
[13] "us2000c_race1" "us2000c_marstat" "us2000c_educ" "us2000c_inctot"
colnames(acs) <- sub("us2000c_", "", colnames(acs))
colnames(acs)
[1] "year" "datanum" "household" "hhwt" "gq" "serialno" "person"
[8] "perwt" "pnum" "sex" "age" "hispan" "race1" "marstat"
[15] "educ" "inctot"
11.4 Creating Variables
mutate()
is a variable creation and replacement function. If a new variable name is supplied, a new variable is created. If an existing variable name is supplied, that variable is replaced without any warning.
11.4.0.1 Numeric
One way we can create a numeric variable is by multiplying a single existing column by a constant. Multiplying education, assumed to be in years, by 12 results in education in months (assuming year-round school!). The variable educ_months
does not currently exist in acs
, so a new variable is created.
acs <-
acs %>%
mutate(educ_months = educ * 12)
Variables can also be created from multiple existing columns, through addition, multiplication, averages, minimums, or any other number and combination of functions.
acs <-
acs %>%
mutate(nonsense1 = educ * marstat,
nonsense2 = (educ + marstat) ^ person,
nonsense3 = ifelse(educ %% 36 == 0, marstat, person))
11.4.1 Character
Currently, the identifier is spread out across two variables: household and person. We can put these two together with paste()
so that we have a single variable that uniquely identifies observations.
acs <-
acs %>%
mutate(id = paste(household, person, sep = "_"))
If we are working with multiple datasets which we plan to join together, it might be good to add a note identifying the source of the data. We can supply a single value to mutate()
, and this is recycled down the column. We can also supply a character vector with length greater than one (c("odd", "even")
), one or more numbers to be repeated down a column (1:2
, 2000
), or even today’s date (Sys.Date()
).
Recycling a vector with length greater than one requires the use of rep()
with the argument len = nrow(.)
, where .
is shorthand for the object piped into the current function. In other words, odd_even = rep(c("odd", "even"), len = nrow(.))
means, “Create (or replace) a column called ‘odd_even’ that repeats the vector c("odd", "even")
until it reaches a length equal to the number of rows in our acs
dataframe.”
acs <-
acs %>%
mutate(data_source = "acs",
odd_even = rep(c("odd", "even"), len = nrow(.)),
one_two = rep(1:2, len = nrow(.)),
year_again = 2000,
last_edited = Sys.Date())
11.4.1.1 Categorical
Dummy coded variables can be created from other variables with two or more values. Dummy coded variables have one where the variable has a certain value, and a zero for all other values.
The sex
column is 1s and 2s. Let’s assume these correspond to male and female, respectively. We can create a new column called female
that contains 0 for male and 1 for female. One way we might do this is by simply subtracting 1 from the sex
column, but this only works in cases like this where we have two levels. A slightly longer but more adaptable way is to use a logical test. Check if sex %in% 2
, and then convert this to numeric. Remember: TRUE
is equal to 1, and FALSE
is equal to 0.
acs <-
acs %>%
mutate(female = as.numeric(sex %in% 2))
acs <-
acs %>%
mutate(female = ifelse(sex %in% 2, 1, 0))
We can also recode the gq
column. Let’s look at the values this variable can take.
table(acs$gq)
Additional households under 1990 definition Group quarters--Institutions
71 406
Households under 1970 definition Other group quarters
27339 356
Perhaps we do not need all of these categories. Maybe we only care if the individual lives in a household under the 1970 definition. We can adapt the above code to create a new column:
acs <-
acs %>%
mutate(households1970 = as.numeric(gq %in% "Households under 1970 definition"))
Unlike the discrete variables of sex
and gq
, age
is a continuous variable, but we can adapt the above code to dummy code it. Create a dichotomous variable of whether an individual is an adult (age is 18 or more).
acs <-
acs %>%
mutate(adult = as.numeric(age >= 18))
A multiple-level categorical variable can be created with case_when()
. Each argument within case_when()
is condition ~ value
, where if a condition is met, a value is assigned. Anything not covered by any of the conditions we provide is assigned a value of NA
.
We might want to turn a continuous variable like educ
into distinct categories of less than high school (educ < 12
), high school (educ == 12
), some college (educ > 12 & educ < 16
), and college graduate (educ >= 16
). We can assign these the values 0-3.
acs <-
acs %>%
mutate(educ_categories = case_when(educ < 12 ~ 0,
educ == 12 ~ 1,
educ > 12 & educ < 16 ~ 2,
educ >= 16 ~ 3))
We can imagine using this educ_categories
variable as a predictor in a statistical model or for creating histograms of income by educational attainment.
11.5 Changing Values
Instead of acting at the whole-variable level, we can also use mutate()
to change some values within a variable.
11.5.1 Change Values to Missing
In our exploration of the data, recall that inctot
is a character vector, but the first few values shown by str()
appear to be numbers.
Open the data set with View(acs)
to see why it is a character vector. Some of the values are BBBBBBB
. The Census uses this code for missing data. We can recode the B’s as missing values with the na_if()
function while leaving the other values as they are.
At this step, the column is still a character vector, so we need to convert it into a numeric vector.
acs <-
acs %>%
mutate(inctot = na_if(inctot, "BBBBBBB"))
At this step, the column is still a character vector, so we need to convert it into a numeric vector.
class(acs$inctot)
[1] "character"
acs <-
acs %>%
mutate(inctot = as.numeric(inctot))
Another approach we could take if we knew our missing code in advance is to specify this when reading in the data. If you take this approach, you will need to re-run the above code of renaming columns in order to follow along for the remainder of this chapter.
acs <- read.csv("2000_acs_sample.csv", na.strings = "BBBBBBB")
11.5.1.1 Quantify Missing Data
We should now check how much data is missing from the dataframe.
We can calculate how much data is missing from acs
as a whole. To do so, first use is.na()
as a test of whether the data is missing. This will turn the entire dataframe into TRUE
and FALSE
values, where TRUE
means the data is missing. Then, take the sum or the mean. In doing so, TRUE
and FALSE
will be coerced into 1 and 0, respectively.
acs %>% is.na() %>% sum()
[1] 6157
acs %>% is.na() %>% mean()
[1] 0.007285011
A total of 6157 values are missing, 0.7% of our dataset.
To calculate missingness by individual columns, first turn the dataframe into logical values with is.na()
as above, and then take column sums or means.
acs %>% is.na() %>% colSums()
year datanum household hhwt gq
0 0 0 0 0
serialno person perwt pnum sex
0 0 0 0 0
age hispan race1 marstat educ
0 0 0 0 0
inctot educ_months nonsense1 nonsense2 nonsense3
6157 0 0 0 0
id data_source odd_even one_two year_again
0 0 0 0 0
last_edited female households1970 adult educ_categories
0 0 0 0 0
acs %>% is.na() %>% colMeans()
year datanum household hhwt gq
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
serialno person perwt pnum sex
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
age hispan race1 marstat educ
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
inctot educ_months nonsense1 nonsense2 nonsense3
0.2185503 0.0000000 0.0000000 0.0000000 0.0000000
id data_source odd_even one_two year_again
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
last_edited female households1970 adult educ_categories
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
We now see that all of the missing values are in inctot
, and 21.9% of this variable is missing.
If you work with missing data and imputation, be sure to check out this excellent resource by Stef van Buren on using the mice package.
11.5.2 Change Values to Other Values
Earlier we saw that age
had a maximum value of 933. If we assume this variable is in years, this value seems way too high. Use table()
to see the values age
takes.
table(acs$age)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
371 391 364 423 390 388 376 392 426 450 444 403 461 437 441 396 400 367 375 407 415 362 363
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
412 345 352 376 378 374 397 441 397 373 402 387 440 466 454 462 435 466 473 454 421 431 410
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
440 425 395 363 361 365 357 351 285 279 275 288 266 225 234 226 227 230 236 185 195 187 182
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 92 93
203 175 187 188 186 183 155 150 138 141 121 118 109 109 98 90 74 71 67 54 39 1 149
933
1
Only one observation has a value of 933, and the next highest value is 93. We could take at least three approaches to deal with this number. We might think it should be 93 and whoever entered the data made a typo, we could change this value to missing because we are not certain what the value should have been, or we could drop this case altogether.
(Note: The original ACS dataset did not have this value of 933 for age. The value was intentionally edited from 93 to 933 for this exercise.)
If we take the third approach, we can drop this particular row by taking a subset, either with subset(acs, subset = !age %in% 933)
or filter(acs, !age %in% 933)
(see chapters on Subsetting for more). If we want to make it missing, we can use the na_if()
approach above. If we think, rather, if we know, it should be 93, we can use mutate()
in conjunction with ifelse()
.
The arguments of ifelse()
are condition, value if TRUE, and value if FALSE. The code below checks if a value of age
is 933. If it is, it changes it to 93. If not, it uses the value of that observation from age
.
acs <-
acs %>%
mutate(age = ifelse(age %in% 933, 93, age))
11.6 Save Your Dataframe and Script
Now that we have cleaned up the ACS data set, it is a good idea to end the script by saving the cleaned data set.
write.csv(acs, "2000c_acs_cleaned.csv", row.names = F)
By saving the resulting data set, you can now begin the next script (“02_…”) with read.csv("2000c_acs_cleaned.csv")
. This first script is your record of how you made changes to the raw data. It serves as a record to future-you, to remind you of what you did, and to colleagues and journal reviewers who have questions.
11.7 First Steps Exercises
Start a script that loads
dplyr
and thesleep
dataset.Read the documentation at
help(sleep)
.Examine the data. What type is each column? How are the data distributed? Is any data missing?
Add a new column that says “One” if
group
is 1, and “Two” ifgroup
is 2.Replace
extra
withNA
if it is below zero.Multiply
extra
by 10.Create a new column that pastes together all of the other columns, separating them with
-
.Capitalize the first letter of each column name (change
extra
toExtra
, etc.).Save the dataset as a csv file, and save your script.