16 Subsetting
Subsetting involves making a dataset smaller by selecting a portion of the data that we are interested in.
We can take subsets of a dataset either by columns or by rows. For example, we might have a dataset with extra columns we do not care about, or we might want to look at rows where a certain condition is true (such as age greater than or equal to 18).
16.1 Extractors
We have previously seen that we can index vectors by position, name, or
condition to extract and replace elements. Now we extend this to
data frames. In addition to the square bracket extractors, [ ]
, we
also have two “list” extractors, the dollar sign , $
, and the
doubled square brackets, [[ ]]
. Here it is relevant that a data frame
is a special kind of list.
The square brackets,
[ ]
, allow us to index both the rows and the columns of a data frame. We can index by position, name, or condition.When used with data frames (or matrices), we index by two positions, rows and columns. The square brackets should contain two objects, a vector indexing the rows and a vector indexing the columns. To select all the elements along one dimension, omit that vector but include the comma.
Try these examples:
mtcars["Valiant", "mpg"] # a single element mtcars[ , c("mpg", "wt")] # all rows, two columns mtcars[1:6, ] # rows 1-6, all columns mtcars[ , ] # everything
The next example illustrates indexing rows by position and columns by name.
mtcars[1:6, c("mpg", "wt")] # [rows, columns]
mpg wt Mazda RX4 21.0 2.620 Mazda RX4 Wag 21.0 2.875 Datsun 710 22.8 2.320 Hornet 4 Drive 21.4 3.215 Hornet Sportabout 18.7 3.440 Valiant 18.1 3.460
The dollar sign extractor,
$
, allows us to index a single named element of a list. In a data frame, that would be a column, a single variable. (We’ll use thehead
function to show just the first few values.)head( mtcars$mpg )
[1] 21.0 21.0 22.8 21.4 18.7 18.1
The double brackets,
[[ ]]
, allow us to index a single element of a list by either name or position. This can be useful for programming repeated tasks.head( mtcars[[3]] ) # by position, displacement
[1] 160 160 108 258 360 225
head( mtcars[["disp"]] ) # by name
[1] 160 160 108 258 360 225
In addition to these extractors, we also have a
subset
function that provides a very useful variation on the square bracket extractor.subset(mtcars, subset=carb >=6, # rows select=c("mpg", "wt")) # columns
mpg wt Ferrari Dino 19.7 2.77 Maserati Bora 15.0 3.57
16.2 Subsetting by Columns
Now, load the dplyr
package, which contains the functions we need to subset.
library(dplyr)
The select()
function from dplyr
gives us a number of ways to take a subset of columns. Four ways to select columns are covered below: By Name, By Pattern, By Range, and By Type. For a full list of the ways to select a subset of columns, run ?tidyselect::language
.
16.2.1 By Name
The simplest and most straightforward way to select columns is to pass their names to select()
. The output of select()
will be our original dataframe, but it will only include the columns we name, and they will be in the specified order. In the mtcars
dataset, cyl
is before hp
, but if we specify hp
before cyl
in select()
, they will appear in that order.
(Note that the code below uses head()
to only show the first six rows for space reasons.)
colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
mtcars %>%
select(mpg, hp, cyl) %>%
head()
mpg hp cyl
Mazda RX4 21.0 110 6
Mazda RX4 Wag 21.0 110 6
Datsun 710 22.8 93 4
Hornet 4 Drive 21.4 110 6
Hornet Sportabout 18.7 175 8
Valiant 18.1 105 6
As it was used above, select()
only keeps the columns we name. We can also have it only drop the columns we name by using a -
in front of the column name.
mtcars %>%
select(-disp, -drat, -wt, -qsec, -vs, -am, -gear, -carb) %>%
head()
mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Valiant 18.1 6 105
In some cases, we might want to select a certain subset of columns multiple times, and we do not want to have to retype them every time. In that case, we can pass a character vector to select()
. We can also add another column, hp
, without adding it to our list when running the command.
myCols <- c("mpg", "vs", "carb")
mtcars %>%
select(myCols, hp) %>%
head()
Note: Using an external vector in selections is ambiguous.
i Use `all_of(myCols)` instead of `myCols` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
mpg vs carb hp
Mazda RX4 21.0 0 4 110
Mazda RX4 Wag 21.0 0 4 110
Datsun 710 22.8 1 1 93
Hornet 4 Drive 21.4 1 1 110
Hornet Sportabout 18.7 0 2 175
Valiant 18.1 1 1 105
We can also simply reorder the columns by giving select()
the columns we would like first and then everything()
.
mtcars %>%
select(myCols, hp, everything()) %>%
head()
mpg vs carb hp cyl disp drat wt qsec am gear
Mazda RX4 21.0 0 4 110 6 160 3.90 2.620 16.46 1 4
Mazda RX4 Wag 21.0 0 4 110 6 160 3.90 2.875 17.02 1 4
Datsun 710 22.8 1 1 93 4 108 3.85 2.320 18.61 1 4
Hornet 4 Drive 21.4 1 1 110 6 258 3.08 3.215 19.44 0 3
Hornet Sportabout 18.7 0 2 175 8 360 3.15 3.440 17.02 0 3
Valiant 18.1 1 1 105 6 225 2.76 3.460 20.22 0 3
select()
includes the functionality of rename()
, so we can do in one line (second example) what normally takes two (first example). The pattern is new_name = old_name
.
mtcars %>%
rename(horsepower = hp) %>%
select(horsepower, mpg) %>%
head()
horsepower mpg
Mazda RX4 110 21.0
Mazda RX4 Wag 110 21.0
Datsun 710 93 22.8
Hornet 4 Drive 110 21.4
Hornet Sportabout 175 18.7
Valiant 105 18.1
mtcars %>%
select(horsepower = hp, mpg) %>%
head()
horsepower mpg
Mazda RX4 110 21.0
Mazda RX4 Wag 110 21.0
Datsun 710 93 22.8
Hornet 4 Drive 110 21.4
Hornet Sportabout 175 18.7
Valiant 105 18.1
16.2.2 By Pattern
Columns can be selected by a pattern, such as the letters they start with or contain. To do this, we need to use other functions inside of select()
.
starts_with()
returns columns that start with a character string. Both “disp” and “drat” start with “d”.
mtcars %>%
select(starts_with("d")) %>%
head()
disp drat
Mazda RX4 160 3.90
Mazda RX4 Wag 160 3.90
Datsun 710 108 3.85
Hornet 4 Drive 258 3.08
Hornet Sportabout 360 3.15
Valiant 225 2.76
ends_with()
does the opposite, and looks for a character string at the end of the column names. “disp” and “hp” end with “p”.
mtcars %>%
select(ends_with("p")) %>%
head()
disp hp
Mazda RX4 160 110
Mazda RX4 Wag 160 110
Datsun 710 108 93
Hornet 4 Drive 258 110
Hornet Sportabout 360 175
Valiant 225 105
contains()
is more general than both starts_with()
and ends_with()
since it looks for a character string anywhere in the column name. “ar” is found in both “gear” and “carb”.
mtcars %>%
select(contains("ar")) %>%
head()
gear carb
Mazda RX4 4 4
Mazda RX4 Wag 4 4
Datsun 710 4 1
Hornet 4 Drive 3 1
Hornet Sportabout 3 2
Valiant 3 1
The functions above all use literal character strings. dplyr
also allows for the use of regular expressions with the matches()
function. The code below, "^[a-g].a"
, looks for a string that starts with (^
) the letters a-g ([a-g]
), then has any character (.
), and then the letter a (a
). Both “drat” and “gear” fulfill these matching criteria.
mtcars %>%
select(matches("^[a-g].a"))
drat gear
Mazda RX4 3.90 4
Mazda RX4 Wag 3.90 4
Datsun 710 3.85 4
Hornet 4 Drive 3.08 3
Hornet Sportabout 3.15 3
Valiant 2.76 3
Duster 360 3.21 3
Merc 240D 3.69 4
Merc 230 3.92 4
Merc 280 3.92 4
Merc 280C 3.92 4
Merc 450SE 3.07 3
Merc 450SL 3.07 3
Merc 450SLC 3.07 3
Cadillac Fleetwood 2.93 3
Lincoln Continental 3.00 3
Chrysler Imperial 3.23 3
Fiat 128 4.08 4
Honda Civic 4.93 4
Toyota Corolla 4.22 4
Toyota Corona 3.70 3
Dodge Challenger 2.76 3
AMC Javelin 3.15 3
Camaro Z28 3.73 3
Pontiac Firebird 3.08 3
Fiat X1-9 4.08 4
Porsche 914-2 4.43 5
Lotus Europa 3.77 5
Ford Pantera L 4.22 5
Ferrari Dino 3.62 5
Maserati Bora 3.54 5
Volvo 142E 4.11 4
For more on regular expressions, see the chapter on Character Vectors or RStudio’s Regular Expressions cheatsheet.
16.2.3 By Range
The :
operator can be used to select a range of variables, either by position or by number. Be careful to check that your columns are in the order you think they are!
mtcars %>%
select(2:4) %>%
head()
cyl disp hp
Mazda RX4 6 160 110
Mazda RX4 Wag 6 160 110
Datsun 710 4 108 93
Hornet 4 Drive 6 258 110
Hornet Sportabout 8 360 175
Valiant 6 225 105
mtcars %>%
select(cyl:hp) %>%
head()
cyl disp hp
Mazda RX4 6 160 110
Mazda RX4 Wag 6 160 110
Datsun 710 4 108 93
Hornet 4 Drive 6 258 110
Hornet Sportabout 8 360 175
Valiant 6 225 105
If we have multiple columns with a common prefix and a numeric suffix, we can select a range with num_range()
. First, let’s change the column names of mtcars
to create a fictional dataset with individuals’ incomes in different years, where each row is an individual and each column is their reported income for a different year.
mtcars_income <- mtcars
colnames(mtcars_income) <- paste0("income", 1960:1970)
colnames(mtcars_income)
[1] "income1960" "income1961" "income1962" "income1963" "income1964" "income1965"
[7] "income1966" "income1967" "income1968" "income1969" "income1970"
mtcars_income %>%
select(num_range("income", 1962:1964)) %>%
head()
income1962 income1963 income1964
Mazda RX4 160 110 3.90
Mazda RX4 Wag 160 110 3.90
Datsun 710 108 93 3.85
Hornet 4 Drive 258 110 3.08
Hornet Sportabout 360 175 3.15
Valiant 225 105 2.76
mtcars_income %>%
select(num_range("income", seq(1960, 1970, 2))) %>%
head()
income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4 21.0 160 3.90 16.46 1 4
Mazda RX4 Wag 21.0 160 3.90 17.02 1 4
Datsun 710 22.8 108 3.85 18.61 1 1
Hornet 4 Drive 21.4 258 3.08 19.44 0 1
Hornet Sportabout 18.7 360 3.15 17.02 0 2
Valiant 18.1 225 2.76 20.22 0 1
To use num_range()
, give it the prefix ("income"
) and a numeric vector with the desired numeric suffixes. The first example specifies the range 1962:1964
, giving us 1962, 1963, and 1964. The second example uses seq(1960, 1970, 2)
, so we will get every two years from 1960 to 1970.
mtcars_income %>%
select(num_range("income", 1962:1964)) %>%
head()
income1962 income1963 income1964
Mazda RX4 160 110 3.90
Mazda RX4 Wag 160 110 3.90
Datsun 710 108 93 3.85
Hornet 4 Drive 258 110 3.08
Hornet Sportabout 360 175 3.15
Valiant 225 105 2.76
mtcars_income %>%
select(num_range("income", seq(1960, 1970, 2))) %>%
head()
income1960 income1962 income1964 income1966 income1968 income1970
Mazda RX4 21.0 160 3.90 16.46 1 4
Mazda RX4 Wag 21.0 160 3.90 17.02 1 4
Datsun 710 22.8 108 3.85 18.61 1 1
Hornet 4 Drive 21.4 258 3.08 19.44 0 1
Hornet Sportabout 18.7 360 3.15 17.02 0 2
Valiant 18.1 225 2.76 20.22 0 1
If the numbers are out of order in the dataframe, this is not an issue, unlike with :
. The examples below first select the columns income1962
, income1964
, and income1963
, in that order. The second example with :
fails to select income1963
because it selects columns by position, not by name.
mtcars_income %>%
select(income1962, income1964, income1963) %>%
select(num_range("income", 1962:1964)) %>%
head()
income1962 income1963 income1964
Mazda RX4 160 110 3.90
Mazda RX4 Wag 160 110 3.90
Datsun 710 108 93 3.85
Hornet 4 Drive 258 110 3.08
Hornet Sportabout 360 175 3.15
Valiant 225 105 2.76
mtcars_income %>%
select(income1962, income1964, income1963) %>%
select(income1962:income1964) %>%
head()
income1962 income1964
Mazda RX4 160 3.90
Mazda RX4 Wag 160 3.90
Datsun 710 108 3.85
Hornet 4 Drive 258 3.08
Hornet Sportabout 360 3.15
Valiant 225 2.76
It is also okay if some columns are not in the dataset. Of course, select()
will not return these non-existent columns, but it will also not have an error. To see this, we can drop income1963
.
mtcars_income %>%
select(-income1963) %>%
select(num_range("income", 1962:1964)) %>%
head()
income1962 income1964
Mazda RX4 160 3.90
Mazda RX4 Wag 160 3.90
Datsun 710 108 3.85
Hornet 4 Drive 258 3.08
Hornet Sportabout 360 3.15
Valiant 225 2.76
16.2.4 By Type
which()
can help us select columns that are of a certain type, such as numeric or character. chickwts
has one numeric column and one character column.
chickwts %>%
head()
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
chickwts %>%
select(where(is.numeric)) %>%
head()
weight
1 179
2 160
3 136
4 227
5 217
6 168
16.2.5 Multiple Criteria
For more complex tasks, multiple criteria for selecting columns can be combined with logical operators (&
, |
, !
).
mtcars %>%
select((starts_with("d") & !where(is.character)) | contains("ar"), mpg) %>%
head()
disp drat gear carb mpg
Mazda RX4 160 3.90 4 4 21.0
Mazda RX4 Wag 160 3.90 4 4 21.0
Datsun 710 108 3.85 4 1 22.8
Hornet 4 Drive 258 3.08 3 1 21.4
Hornet Sportabout 360 3.15 3 2 18.7
Valiant 225 2.76 3 1 18.1
16.3 Subsetting by Rows
To select a subset of rows from a dataframe, we can provide one or more conditions to the filter()
function.
If we want to see rows from mtcars
with mpg
greater than 25, we can do this as follows:
mtcars %>%
filter(mpg > 25)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Multiple conditions can be given to filter()
, and these are separated by commas. If we give multiple conditions to filter()
, it will only return rows that meet all of our conditions. In other words, we can think of each ,
like &
, and the rows in the output are those for which all conditions are TRUE
. Note how the two blocks of code below return the same output.
mtcars %>%
filter(mpg > 25,
hp < 100,
vs == 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
mtcars %>%
filter(mpg > 25 &
hp < 100 &
vs == 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Because filter()
runs logical tests on the rows of our dataset, we should make use of R’s logical operators (&
, |
, !
), relational operators (<
, >
, <=
, >=
, ==
, !=
), and other functions that return logical values (%in%
, is.na()
, etc.). If it has been a while, or if you jumped to this chapter, you will find the chapter on Logical Vectors helpful.
To illustrate the use of is.na()
, we first need some missing data. Since mtcars
has no missing data, we can use the mutate()
function to temporarily add a column with NA
values. The mutate()
code below gives us a new column called missingdata
that repeats the vector c(1, 2, 3, 4, 5, 6, 7, NA)
four times. (mtcars
has 32 rows, so we need to repeat a vector with 8 numbers 4 times.) !is.na()
will then return rows that do not (!
) have NA
in missingdata
.
mtcars %>%
mutate(missingdata = rep(c(1:7, NA), 4)) %>%
filter(cyl %in% c(4, 6),
mpg > 17 & mpg <= 21,
am != 0 | gear == 4,
!is.na(missingdata))
mpg cyl disp hp drat wt qsec vs am gear carb missingdata
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 2
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 3
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 6
dplyr
has two other functions that are useful when using filter()
: between()
and near()
.
between()
is a shortcut for variable >= number & variable <= number
. It takes three arguments: the variable name, the lower boundary, and the upper boundary. The two blocks of code below are equivalent.
mtcars %>%
filter(between(mpg, 25, 30))
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.9 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
mtcars %>%
filter(mpg >= 25 & mpg <= 30)
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.9 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
near()
is an alternative to ==
that takes into account computer precision. Sometimes numbers that should be equal actually differ by a very small amount because of the way computers store data. near()
takes this into account, but ==
tests for exact equality.
sqrt(2) ^ 2
[1] 2
(sqrt(2) ^ 2) == 2
[1] FALSE
near(sqrt(2) ^ 2, 2)
[1] TRUE
The tolerance, or how much error we can accept, when testing equality is specified by the tol
argument. Its default is sqrt(.Machine$double.eps)
, or about 0.000000015. Changing the tolerance allows us to do more with the near()
function. Instead of specifying lower and upper boundaries like with between()
, with near()
, we can specify the middle value and how much lower or higher we want. Again, the two blocks of code below are equivalent.
mtcars %>%
filter(near(mpg, 27.5, 2.5))
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.9 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
mtcars %>%
filter(mpg >= (27.5 - 2.5) & mpg <= (27.5 + 2.5))
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.9 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Instead of specifying the values directly, we can compute them from the data. The code below filters to get rows with mpg
within one standard deviation of the mean.
mtcars %>%
filter(near(mpg, mean(mpg), sd(mpg)))
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Any calculations we have filter()
do are done on whatever grouping exists in the dataframe. This can cause problems if we are forget our dataframe is grouped, but we can also use it for more advanced filtering. The code below is similar to the code above, except the mean and standard deviation are calculated separately for each value of cyl
(4, 6, 8). (Four-cylinder cars have their own mean and standard deviation for mpg
, six-cylinder cars have theirs, and so do eight-cylinder cars. See the chapter on Aggregating for more on group_by()
.)
mtcars %>%
group_by(cyl) %>%
filter(near(mpg, mean(mpg), sd(mpg)))
# A tibble: 21 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
9 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
10 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
# ... with 11 more rows
16.4 Subsetting Exercises
Use the state.x77
dataset for 1 and 2.
Keep all columns except
Population
. (Hint: If you received an error message aboutUseMethod
, what kind of object doesselect()
require?)Drop columns with spaces in their names.
Use the iris
dataset for 3 and 4.
Drop any columns that are not numeric or do not start with “S”.
Keep rows where
Sepal.Length
is divisible by 1.0. (Hint: see%%
in?Arithmetic
.)
Use the airquality
dataset for 5 and 6.
Find days where the value of
Ozone
is in the range 30-50.Find days where the value of
Ozone
is more than two standard deviations away from the mean, with mean and standard deviation calculated by month. (Hint: If you got zero rows in the output, did you check if Ozone is fully observed?)
Bonus:
- Use the
mtcars
dataset. Create a 3x3 scatterplot matrix (pass a dataframe toplot()
) of the mpg, weight, and horsepower of cars that have four or six cylinders, at least four forward gears, and manual transmission. Seehelp(mtcars)
to figure out which column is which.