Very Brief Intro to R

R vs. RStudio

R is the language.

RStudio is the interface we use to interact with R.

You can run R by itself (comamnd line), but RStudio is much more popular.

Update R at least once per year, and RStudio whenever

Starting Settings

Tools, Global Options, General

Uncheck all boxes under R Sessions, Workspace, and History.

Set Save Workspace to .Rdata on exit to Never.

In Code check Soft-wrap R source files.

Check out options in Appearance.

Objects

R is an object oriented language. Objects can contain various kinds of data depending on their type.

A dataframe is one kind of object. It contains many other objects. Unlike Stata, you can have many dataframes at the same time. This means you always have to specify which dataframe you are talking about.

The assignment operator in R is <- (Alt-_). It is similar to = in Stata (not ==).

Objects are created using the assignment operator:

x <- 1

If you reference an object without assigning it to anything, RStudio will assume you want to print it (implicit print).

[1] 1

You can use an object in an expression without assigning the result. The result will be printed instead.

x + 1

[1] 2

The object remains unchanged:

[1] 1

You change an object by assigning something else to it. It does not have to be the same type of object:

x <- "Hello World"
x

[1] "Hello World"

Be careful: there’s no gen or replace to warn you that you are creating or changing an object.

Functions

Functions do most of the work.

x <- rnorm(10, mean=5)
x

 [1] 3.518141 1.609850 5.785935 5.179979 6.611007 5.696884 6.101362 4.132358
 [9] 4.246165 4.066455

The rnorm() function draws random numbers from the normal distribution.

The first argument (10) is how many numbers to draw. This is a positional argument. rnorm() knows what it means by its position (first). The first argument is usually a “most important” argument. Often it’s the dataframe or other object the function will act on. Hopefully it will be obvious and easy to remember. Sometimes the second argument is pretty obvious too.

The second argument is the mean of the normal distribution to draw from. This is a named argument, so it’s always clear what it means.

Positional arguments must always go before named arguments, but named arguments can go in any order (position doesn’t matter).

All arguments have names and you are welcome to use them. (The first argument for rnorm() is called n.) Or you can use all positional arguments if you can remember the order. (The mean argument is second.)

x <- rnorm(n=10, mean=5)
x

 [1] 2.817088 5.213108 3.249033 4.570739 5.218907 4.893723 3.569444 3.816337
 [9] 5.623607 3.809000

x <- rnorm(10, 5)
x

 [1] 6.353828 6.734600 3.939010 5.260093 4.858859 5.687071 3.732832 4.675296
 [9] 3.030180 6.102467

My suggestion is use one or two positional arguments if they’re easy to remember and use names for everything after that.

The rnorm() function returns a vector object. Vectors contain multiple objects of the same type. In this case it is a vector of numbers. (Dataframes are a collection of vectors.)

Packages & Libraries

Base R comes with some functions, but many more (and many you’ll need) are defined in packages. Packages need to be installed on your computer before you can use them. This is done using the install.packages() function.

#install.packages("tidyverse")

This takes a long time! Don’t put install.packages() in your working R scripts. You can run it from the console, or create a separate script that installs all the packages you need. This can be useful because every time you update R you have to re-install your packages. RStudio will also notice if you try to use a package you haven’t installed and ask to install it for you.

Once a package is installed, you load it into your current R session with the library() function.

library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.4     v readr     2.1.5
v forcats   1.0.0     v stringr   1.5.1
v ggplot2   3.5.1     v tibble    3.2.1
v lubridate 1.9.3     v tidyr     1.3.1
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Sometimes packages use the same function names. When you call a function, you’ll get the version from the most recently loaded package. Other functions with the same name are masked.

Pipes

Often you’ll want to use an object that is the result of a function as an argument for another function:

log(mean(1:5), base=10)

[1] 0.4771213

This command starts with 1:5, the vector of numbers from 1 to 5. This is the first (and only) argument of the mean() function, which returns the mean value of the vector. This then becomes the first argument of the log function, which returns its the base-10 log.

This is not easy to read!

The pipe (|>) takes whatever comes before it and inserts it as the first argument of the following function (in other words, the pipe “passes it” to the function). So the piped equivalent is:

1:5 |>
  mean() |>
  log(base=10)

[1] 0.4771213

1:5 is passed to mean() and the result is passed to log(). Much clearer! Unfortunately it only works when the object to be passed in is the first argument of the function, but the tidyverse is built to do that whenever possible.

If you start with an assignment, then the result of the whole chain of function calls is stored in the object:

x <- 
  1:5 |>
  mean() |>
  log(base=10)
x

[1] 0.4771213

Learn More

See the SSCC KB, or better yet take Jason’s classes.

A First Example of Web Scraping

Take a look at this page of Dane County election results. Here’s a simple attempt to scrape it into R:

library(polite)
library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

# download webpage
dat <- 
  bow("https://elections.countyofdane.com/Election-Result/135") |> 
  scrape()

Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
strings not representable in native encoding will be translated to UTF-8

# scrape tables
results <- 
  dat |> 
  html_table()

# name tables
names(results) <-
  dat |> 
  html_elements("h4") |> 
  html_text2()

results

$`Representative to the Assembly District 37 - Official Canvass`
# A tibble: 4 x 3
  Candidate                          `Vote Percentage` `Number of Votes`
  <chr>                              <chr>             <chr>            
1 Dem Pete Adams (Dem)               59.2%             1,161            
2 Rep William Penterman (Rep)        38.7%             758              
3 IND Stephen W. Ratzlaff, Jr. (IND) 2.1%              41               
4 Write-in (Non)                     0.0%              0                

$`County Supervisor District 19 - Official Canvass`
# A tibble: 3 x 3
  Candidate               `Vote Percentage` `Number of Votes`
  <chr>                   <chr>                         <int>
1 Timothy Rockwell (Non)  52.4%                           610
2 Kristen M. Morris (Non) 47.5%                           553
3 Write-in (Non)          0.1%                              1

$`PERCENTAGE OF REGISTERED VOTERS THAT CAST BALLOTS`
# A tibble: 3 x 2
  X1                      X2    
  <chr>                   <chr> 
1 Registered Voters Total 21,787
2 Ballots Cast Total      3,127 
3 Turnout Percentage      14.4%

Discuss with a partner (or two). Talk through the structure of the script even though you don’t know what the functions do yet. (What gets passed to what?) What else do you notice about the script? The results?