Regression Formulas and Building Models

Doug Hemken

June 2018

First, load some example data and look at it.

load(url("http://www.ssc.wisc.edu/~hemken/Rworkshops/hsb.RData"))

head(hsb)
##    id gender  race    ses schtyp     prog read write math science socst
## 1  70   male white    low public  general   57    52   41      47    57
## 2 121 female white middle public vocation   68    59   53      63    61
## 3  86   male white   high public  general   44    33   54      58    31
## 4 141   male white   high public vocation   63    44   47      53    56
## 5 172   male white middle public academic   47    52   57      53    61
## 6 113   male white middle public academic   44    52   51      63    61

Many commands in R are specified in terms of formulas. A formula has a tilde ( ~ ), and terms on the left-hand side or right-hand side are composed of object names (usually existing vectors/ variables/columns in a data frame, but this can also include matrices, or the results of embedded functions like log()). Terms are connected by a variety of math-like symbols that have their own algebra.

Formulas have their own distinct class, and can even be saved as formula objects.

For example, a scatterplot can be specified as a formula within the plot() function. Where we can specify the relations between variables using a formula, we can almost always also specify a data= parameter to point to the source of the variables.

plot(math ~ read, data=hsb)

class(math ~ read)
## [1] "formula"

Formulas are the central element in specifying regression models, using lm() and a variety of other modeling functions.

# add a regression line to the plot
plot(math ~ read, data=hsb)
abline(21.0382, 0.6051)

lm(math ~ read, data=hsb)
## 
## Call:
## lm(formula = math ~ read, data = hsb)
## 
## Coefficients:
## (Intercept)         read  
##     21.0382       0.6051

Anova models work in the same way. Note that prog is stored as a factor - this is crucial to getting the correct model.

plot(math ~ prog, data=hsb)

lm(math ~ prog, data=hsb)
## 
## Call:
## lm(formula = math ~ prog, data = hsb)
## 
## Coefficients:
##  (Intercept)   proggeneral  progvocation  
##       56.733        -6.711       -10.313

Notice, too, that the jargon can get confusing here. The formula is written with one term on the right-hand side, and a second term, the intercept, is assumed/implied. However, the model has three terms, the intercept and two additional levels of prog.

Sometimes we want to rearrange the levels in a factor. If we just want a different reference category, we can use relevel(). If we want to reorder all the levels, we factor() it again.

str(hsb$ses)
##  Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ...
head(hsb$ses)
## [1] low    middle high   high   middle middle
## Levels: high low middle
lm(math ~ ses, data=hsb)
## 
## Call:
## lm(formula = math ~ ses, data = hsb)
## 
## Coefficients:
## (Intercept)       seslow    sesmiddle  
##      56.172       -7.002       -3.962
hsb$ses <- relevel(hsb$ses, ref="low")
lm(math ~ ses, data=hsb)
## 
## Call:
## lm(formula = math ~ ses, data = hsb)
## 
## Coefficients:
## (Intercept)      seshigh    sesmiddle  
##      49.170        7.002        3.040
hsb$ses <- factor(hsb$ses, levels=c("low", "middle", "high"))
lm(math ~ ses, data=hsb)
## 
## Call:
## lm(formula = math ~ ses, data = hsb)
## 
## Coefficients:
## (Intercept)    sesmiddle      seshigh  
##      49.170        3.040        7.002

A model with more than one term in the formula:

plot(math ~ read, data=hsb)
abline(27.9952, 0.5117)
abline(27.9952-3.4330, 0.5117)
abline(27.9952-5.2158, 0.5117)

lm(math ~ prog + read, data=hsb)
## 
## Call:
## lm(formula = math ~ prog + read, data = hsb)
## 
## Coefficients:
##  (Intercept)   proggeneral  progvocation          read  
##      27.9952       -3.4330       -5.2158        0.5117

Interactions may be specified a couple of different ways. An asterisk, *, means to include the higher order term plus all the related lower order terms. Alternatively, specific interaction terms may be specified with the colon, :.

plot(math ~ read, data=hsb)
abline(21.3612, 0.6298)
abline(21.3612+12.8386, 0.6298-0.3118)
abline(21.3612+6.2033, 0.6298-0.2217)

lm(math ~ prog * read, data=hsb)
## 
## Call:
## lm(formula = math ~ prog * read, data = hsb)
## 
## Coefficients:
##       (Intercept)        proggeneral       progvocation  
##           21.3612            12.8386             6.2033  
##              read   proggeneral:read  progvocation:read  
##            0.6298            -0.3118            -0.2217
# the same model
lm(math ~ prog + read + prog:read, data=hsb)
## 
## Call:
## lm(formula = math ~ prog + read + prog:read, data = hsb)
## 
## Coefficients:
##       (Intercept)        proggeneral       progvocation  
##           21.3612            12.8386             6.2033  
##              read   proggeneral:read  progvocation:read  
##            0.6298            -0.3118            -0.2217

Models may be modified - terms added or removed - through the use of the update() function. In this context, a period, ., represents all the terms included in the previous model.

m1 <- lm(math ~ prog * read, data=hsb)
summary(m1)
## 
## Call:
## lm(formula = math ~ prog * read, data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2340  -5.1950  -0.1676   4.8836  21.7235 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       21.36122    3.88854   5.493 1.23e-07 ***
## proggeneral       12.83861    6.74579   1.903   0.0585 .  
## progvocation       6.20329    6.36169   0.975   0.3307    
## read               0.62982    0.06826   9.227  < 2e-16 ***
## proggeneral:read  -0.31182    0.12858  -2.425   0.0162 *  
## progvocation:read -0.22170    0.12696  -1.746   0.0824 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.675 on 194 degrees of freedom
## Multiple R-squared:  0.5051, Adjusted R-squared:  0.4924 
## F-statistic:  39.6 on 5 and 194 DF,  p-value: < 2.2e-16
m2 <- update(m1, .~.-prog:read) # removes two model terms
summary(m2)
## 
## Call:
## lm(formula = math ~ prog + read, data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.7994  -4.6484  -0.8686   4.8846  19.9834 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.99519    2.96929   9.428  < 2e-16 ***
## proggeneral  -3.43297    1.24908  -2.748  0.00655 ** 
## progvocation -5.21581    1.27015  -4.106  5.9e-05 ***
## read          0.51170    0.05155   9.927  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.761 on 196 degrees of freedom
## Multiple R-squared:  0.487,  Adjusted R-squared:  0.4792 
## F-statistic: 62.03 on 3 and 196 DF,  p-value: < 2.2e-16

In a formula, a minus sign, -, is used to remove a term from a model.

The anova() function gives us a way to make tables of F tests. Given a single model, it returns an anova decomposition of our model. Given two or more models, it compares the models. But one caveat is that it is up to the user (you) to be sure that the models can be compared meaningfully.

anova(m1)
## Analysis of Variance Table
## 
## Response: math
##            Df Sum Sq Mean Sq  F value  Pr(>F)    
## prog        2 4002.1  2001.1  44.9129 < 2e-16 ***
## read        1 4504.3  4504.3 101.0973 < 2e-16 ***
## prog:read   2  315.9   158.0   3.5453 0.03074 *  
## Residuals 194 8643.5    44.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m2,m1)
## Analysis of Variance Table
## 
## Model 1: math ~ prog + read
## Model 2: math ~ prog * read
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    196 8959.4                              
## 2    194 8643.5  2    315.91 3.5453 0.03074 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another point to bear in mind is that the sums of squares reported (and therefore the F tests based on them) are the "type 1" (sequential or experimentalist's) sums of squares. For the "type 2" or "type 3" sums of squares more commonly used in the analysis of observational data, use the Anova() function (capitalized) from the car package (Companion to Applied Regression).

library(car)
Anova(m1)
## Anova Table (Type II tests)
## 
## Response: math
##           Sum Sq  Df  F value    Pr(>F)    
## prog       845.6   2   9.4900 0.0001169 ***
## read      4504.3   1 101.0973 < 2.2e-16 ***
## prog:read  315.9   2   3.5453 0.0307446 *  
## Residuals 8643.5 194                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m1, type=3)
## Anova Table (Type III tests)
## 
## Response: math
##             Sum Sq  Df F value    Pr(>F)    
## (Intercept) 1344.5   1 30.1772 1.228e-07 ***
## prog         166.1   2  1.8640   0.15781    
## read        3793.1   1 85.1356 < 2.2e-16 ***
## prog:read    315.9   2  3.5453   0.03074 *  
## Residuals   8643.5 194                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In formulas, the plus sign, +, means to add terms, but we may also want to represent simple elementwise addition of vectors in a formula. For example, one way to constrain the coefficients of several model terms to be equal is to combine the variables into a single model term. To do this, we use the inhibit function, I(). Expressions within the I() function are interpreted as general R expressions, and not as terms within the formula.

m3 <- lm(math ~ read+write+science+socst, hsb)
coefficients(m3) # to see why we might think they are equal
## (Intercept)        read       write     science       socst 
##  8.96274058  0.27068453  0.22616140  0.25389639  0.08480508
m4 <- lm(math ~ I(read+write+science)+socst, hsb)
anova(m4, m3)
## Analysis of Variance Table
## 
## Model 1: math ~ I(read + write + science) + socst
## Model 2: math ~ read + write + science + socst
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    197 7699.2                           
## 2    195 7691.3  2    7.9174 0.1004 0.9046

The expression within I() becomes a single formula term, which may expand into multiple model terms.

m5 <- lm(math ~ I(read+write+science)*socst, hsb)
summary(m5)
## 
## Call:
## lm(formula = math ~ I(read + write + science) * socst, data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.5873  -4.0466  -0.1618   4.3809  15.1761 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     34.960338  12.925968   2.705  0.00744 **
## I(read + write + science)        0.079293   0.086220   0.920  0.35888   
## socst                           -0.430386   0.254115  -1.694  0.09192 . 
## I(read + write + science):socst  0.003310   0.001598   2.072  0.03962 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.2 on 196 degrees of freedom
## Multiple R-squared:  0.5686, Adjusted R-squared:  0.562 
## F-statistic: 86.12 on 3 and 196 DF,  p-value: < 2.2e-16

Polynomials

lm(math ~ read + read:read, data=hsb)
## 
## Call:
## lm(formula = math ~ read + read:read, data = hsb)
## 
## Coefficients:
## (Intercept)         read  
##     21.0382       0.6051
lm(math ~ read + I(read*read), data=hsb)
## 
## Call:
## lm(formula = math ~ read + I(read * read), data = hsb)
## 
## Coefficients:
##    (Intercept)            read  I(read * read)  
##      24.056044        0.487359        0.001106
m6 <- lm(math ~ read + I(read^2), data=hsb)
summary(m6)
## 
## Call:
## lm(formula = math ~ read + I(read^2), data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.1513  -5.1513  -0.3595   4.7302  16.5695 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 24.056044  11.592589   2.075   0.0393 *
## read         0.487359   0.443661   1.098   0.2733  
## I(read^2)    0.001106   0.004142   0.267   0.7897  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.054 on 197 degrees of freedom
## Multiple R-squared:  0.4388, Adjusted R-squared:  0.4331 
## F-statistic: 77.02 on 2 and 197 DF,  p-value: < 2.2e-16

Orthogonal polynomials, easier to estimate, harder to interpret.

m7 <- lm(math ~ poly(read, 2), data=hsb)
summary(m7)
## 
## Call:
## lm(formula = math ~ poly(read, 2), data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.1513  -5.1513  -0.3595   4.7302  16.5695 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     52.6450     0.4988 105.550   <2e-16 ***
## poly(read, 2)1  87.5258     7.0536  12.409   <2e-16 ***
## poly(read, 2)2   1.8841     7.0536   0.267     0.79    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.054 on 197 degrees of freedom
## Multiple R-squared:  0.4388, Adjusted R-squared:  0.4331 
## F-statistic: 77.02 on 2 and 197 DF,  p-value: < 2.2e-16
anova(m6, m7)
## Analysis of Variance Table
## 
## Model 1: math ~ read + I(read^2)
## Model 2: math ~ poly(read, 2)
##   Res.Df    RSS Df   Sum of Sq F Pr(>F)
## 1    197 9801.5                        
## 2    197 9801.5  0 -1.6371e-11

Centered data: easier to interpret, somewhat easier to estimate.

readc <- hsb$read - mean(hsb$read)
m8 <- lm(math ~ readc + I(readc^2), data=hsb)
summary(m8)
## 
## Call:
## lm(formula = math ~ readc + I(readc^2), data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.1513  -5.1513  -0.3595   4.7302  16.5695 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.529265   0.660686  79.507   <2e-16 ***
## readc        0.602942   0.049462  12.190   <2e-16 ***
## I(readc^2)   0.001106   0.004142   0.267     0.79    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.054 on 197 degrees of freedom
## Multiple R-squared:  0.4388, Adjusted R-squared:  0.4331 
## F-statistic: 77.02 on 2 and 197 DF,  p-value: < 2.2e-16
anova(m6,m7,m8)
## Analysis of Variance Table
## 
## Model 1: math ~ read + I(read^2)
## Model 2: math ~ poly(read, 2)
## Model 3: math ~ readc + I(readc^2)
##   Res.Df    RSS Df   Sum of Sq F Pr(>F)
## 1    197 9801.5                        
## 2    197 9801.5  0 -1.6371e-11         
## 3    197 9801.5  0  0.0000e+00

Limiting higher order interactions

m9 <- lm(math ~ read*write*science*socst, data=hsb)
summary(m9)
## 
## Call:
## lm(formula = math ~ read * write * science * socst, data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4737  -4.2876   0.1648   4.0328  14.4510 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)               2.078e+02  3.182e+02   0.653    0.515
## read                     -4.123e+00  6.771e+00  -0.609    0.543
## write                    -7.674e+00  7.286e+00  -1.053    0.294
## science                  -3.305e-01  6.410e+00  -0.052    0.959
## socst                    -6.199e-01  6.558e+00  -0.095    0.925
## read:write                1.655e-01  1.496e-01   1.107    0.270
## read:science              2.055e-02  1.350e-01   0.152    0.879
## write:science             9.100e-02  1.410e-01   0.645    0.519
## read:socst                3.408e-02  1.367e-01   0.249    0.803
## write:socst               8.463e-02  1.379e-01   0.614    0.540
## science:socst            -4.744e-02  1.291e-01  -0.367    0.714
## read:write:science       -1.976e-03  2.877e-03  -0.687    0.493
## read:write:socst         -2.045e-03  2.762e-03  -0.740    0.460
## read:science:socst        5.008e-04  2.634e-03   0.190    0.849
## write:science:socst      -5.019e-04  2.630e-03  -0.191    0.849
## read:write:science:socst  1.785e-05  5.207e-05   0.343    0.732
## 
## Residual standard error: 6.227 on 184 degrees of freedom
## Multiple R-squared:  0.5915, Adjusted R-squared:  0.5582 
## F-statistic: 17.76 on 15 and 184 DF,  p-value: < 2.2e-16
m10 <- lm(math ~ (read+write+science+socst)^3, data=hsb)
summary(m10)
## 
## Call:
## lm(formula = math ~ (read + write + science + socst)^3, data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3854  -4.1843   0.0599   4.0885  14.4857 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          1.022e+02  7.919e+01   1.290   0.1986  
## read                -1.906e+00  1.994e+00  -0.955   0.3406  
## write               -5.356e+00  2.703e+00  -1.982   0.0490 *
## science              1.779e+00  1.789e+00   0.995   0.3212  
## socst                1.488e+00  2.269e+00   0.656   0.5126  
## read:write           1.176e-01  5.313e-02   2.214   0.0281 *
## read:science        -2.365e-02  3.982e-02  -0.594   0.5532  
## read:socst          -9.768e-03  4.806e-02  -0.203   0.8392  
## write:science        4.569e-02  4.883e-02   0.936   0.3507  
## write:socst          3.964e-02  4.218e-02   0.940   0.3486  
## science:socst       -8.926e-02  4.223e-02  -2.114   0.0359 *
## read:write:science  -1.042e-03  9.172e-04  -1.136   0.2574  
## read:write:socst    -1.124e-03  6.389e-04  -1.759   0.0802 .
## read:science:socst   1.368e-03  7.314e-04   1.870   0.0631 .
## write:science:socst  3.725e-04  6.368e-04   0.585   0.5593  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.212 on 185 degrees of freedom
## Multiple R-squared:  0.5912, Adjusted R-squared:  0.5603 
## F-statistic: 19.11 on 14 and 185 DF,  p-value: < 2.2e-16
m11 <- lm(math ~ (read+write+science+socst)^2, data=hsb)
anova(m10,m11)
## Analysis of Variance Table
## 
## Model 1: math ~ (read + write + science + socst)^3
## Model 2: math ~ (read + write + science + socst)^2
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    185 7139.2                           
## 2    189 7375.3 -4   -236.09 1.5295 0.1954
summary(m11)
## 
## Call:
## lm(formula = math ~ (read + write + science + socst)^2, data = hsb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4705  -4.2089  -0.0629   4.1309  15.1892 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   45.9772403 16.0937548   2.857  0.00476 **
## read          -0.0041504  0.4262509  -0.010  0.99224   
## write         -0.6728462  0.4614207  -1.458  0.14644   
## science        0.0893936  0.3644276   0.245  0.80649   
## socst         -0.0434814  0.3869145  -0.112  0.91064   
## read:write     0.0063856  0.0090835   0.703  0.48293   
## read:science  -0.0054363  0.0071827  -0.757  0.45008   
## read:socst     0.0038652  0.0068955   0.561  0.57578   
## write:science  0.0107600  0.0082152   1.310  0.19187   
## write:socst    0.0008474  0.0069352   0.122  0.90288   
## science:socst -0.0022118  0.0078241  -0.283  0.77772   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.247 on 189 degrees of freedom
## Multiple R-squared:  0.5777, Adjusted R-squared:  0.5554 
## F-statistic: 25.86 on 10 and 189 DF,  p-value: < 2.2e-16
anova(m3,m11)
## Analysis of Variance Table
## 
## Model 1: math ~ read + write + science + socst
## Model 2: math ~ (read + write + science + socst)^2
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    195 7691.3                           
## 2    189 7375.3  6    315.98 1.3496 0.2372

The use of the poly() function brings up another side-light. Matrices may be used a formula terms, where each matrix column becomes a model term.

m <- matrix(runif(200), ncol=5)
lm(m[,1] ~ m[,2:5])
## 
## Call:
## lm(formula = m[, 1] ~ m[, 2:5])
## 
## Coefficients:
## (Intercept)    m[, 2:5]1    m[, 2:5]2    m[, 2:5]3    m[, 2:5]4  
##     0.25144      0.27210      0.05635      0.16750     -0.06199

Step by step checks for a group of variables

add1(m3, scope = ~ .+gender+race+ses+schtyp+prog, data=hsb, test="F")
## Single term additions
## 
## Model:
## math ~ read + write + science + socst
##        Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>              7691.3 739.91                      
## gender  1     40.22 7651.1 740.86  1.0199 0.3138022    
## race    3    182.35 7509.0 741.11  1.5542 0.2019249    
## ses     2     18.26 7673.1 743.43  0.2296 0.7950576    
## schtyp  1      5.08 7686.2 741.77  0.1282 0.7206863    
## prog    2    551.44 7139.9 729.03  7.4531 0.0007622 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m12 <- lm(math ~ ., hsb)
anova(m12)
## Analysis of Variance Table
## 
## Response: math
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## id          1  839.5   839.5 22.6753 3.861e-06 ***
## gender      1    1.8     1.8  0.0496 0.8239642    
## race        3 1098.6   366.2  9.8913 4.418e-06 ***
## ses         2  740.6   370.3 10.0018 7.507e-05 ***
## schtyp      1    6.8     6.8  0.1833 0.6690919    
## prog        2 2934.7  1467.3 39.6352 4.720e-15 ***
## read        1 3665.5  3665.5 99.0110 < 2.2e-16 ***
## write       1  760.5   760.5 20.5431 1.044e-05 ***
## science     1  544.9   544.9 14.7177 0.0001712 ***
## socst       1   24.1    24.1  0.6509 0.4208275    
## Residuals 185 6848.9    37.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(m12, test="F")
## Single term deletions
## 
## Model:
## math ~ id + gender + race + ses + schtyp + prog + read + write + 
##     science + socst
##         Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>               6848.9 736.71                      
## id       1     72.31 6921.2 736.81  1.9531 0.1639240    
## gender   1     11.35 6860.3 735.04  0.3067 0.5804072    
## race     3    251.00 7099.9 737.90  2.2599 0.0829589 .  
## ses      2      2.76 6851.7 732.79  0.0372 0.9634694    
## schtyp   1     41.84 6890.8 735.92  1.1301 0.2891458    
## prog     2    493.14 7342.1 746.61  6.6602 0.0016103 ** 
## read     1    499.61 7348.5 748.79 13.4954 0.0003130 ***
## write    1    225.92 7074.8 741.20  6.1026 0.0144042 *  
## science  1    548.59 7397.5 750.12 14.8184 0.0001629 ***
## socst    1     24.10 6873.0 735.41  0.6509 0.4208275    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
step(m12)
## Start:  AIC=736.71
## math ~ id + gender + race + ses + schtyp + prog + read + write + 
##     science + socst
## 
##           Df Sum of Sq    RSS    AIC
## - ses      2      2.76 6851.7 732.79
## - gender   1     11.35 6860.3 735.04
## - socst    1     24.10 6873.0 735.41
## - schtyp   1     41.84 6890.8 735.92
## <none>                 6848.9 736.71
## - id       1     72.31 6921.2 736.81
## - race     3    251.00 7099.9 737.90
## - write    1    225.92 7074.8 741.20
## - prog     2    493.14 7342.1 746.61
## - read     1    499.61 7348.5 748.79
## - science  1    548.59 7397.5 750.12
## 
## Step:  AIC=732.79
## math ~ id + gender + race + schtyp + prog + read + write + science + 
##     socst
## 
##           Df Sum of Sq    RSS    AIC
## - gender   1     12.87 6864.5 731.16
## - socst    1     28.62 6880.3 731.62
## - schtyp   1     39.66 6891.3 731.94
## <none>                 6851.7 732.79
## - id       1     71.99 6923.7 732.88
## - race     3    254.54 7106.2 734.08
## - write    1    223.57 7075.2 737.21
## - prog     2    501.88 7353.6 742.92
## - read     1    498.21 7349.9 744.82
## - science  1    555.39 7407.1 746.37
## 
## Step:  AIC=731.16
## math ~ id + race + schtyp + prog + read + write + science + socst
## 
##           Df Sum of Sq    RSS    AIC
## - socst    1     28.48 6893.0 729.99
## - schtyp   1     44.02 6908.6 730.44
## <none>                 6864.5 731.16
## - id       1     82.68 6947.2 731.56
## - race     3    262.00 7126.5 732.65
## - write    1    216.83 7081.4 735.38
## - prog     2    511.73 7376.3 741.54
## - read     1    530.98 7395.5 744.06
## - science  1    625.46 7490.0 746.60
## 
## Step:  AIC=729.99
## math ~ id + race + schtyp + prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## - schtyp   1     50.96 6944.0 729.46
## <none>                 6893.0 729.99
## - id       1     95.12 6988.1 730.73
## - race     3    253.37 7146.4 731.21
## - write    1    303.61 7196.6 736.61
## - prog     2    569.55 7462.6 741.87
## - science  1    625.17 7518.2 745.35
## - read     1    695.81 7588.8 747.22
## 
## Step:  AIC=729.46
## math ~ id + race + prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## - id       1     45.51 6989.5 728.77
## <none>                 6944.0 729.46
## - race     3    213.13 7157.1 729.51
## - write    1    288.99 7233.0 735.62
## - prog     2    552.13 7496.1 740.76
## - read     1    656.84 7600.8 745.54
## - science  1    698.22 7642.2 746.62
## 
## Step:  AIC=728.77
## math ~ race + prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## - race     3    173.13 7162.6 727.66
## <none>                 6989.5 728.77
## - write    1    290.10 7279.6 734.90
## - prog     2    637.13 7626.6 742.22
## - read     1    612.74 7602.2 743.58
## - science  1    762.98 7752.5 747.49
## 
## Step:  AIC=727.66
## math ~ prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 7162.6 727.66
## - write    1    395.93 7558.6 736.42
## - prog     2    615.92 7778.5 740.16
## - read     1    581.00 7743.6 741.26
## - science  1    845.74 8008.4 747.98
## 
## Call:
## lm(formula = math ~ prog + read + write + science, data = hsb)
## 
## Coefficients:
##  (Intercept)   proggeneral  progvocation          read         write  
##      16.5056       -3.7924       -4.1233        0.2401        0.2015  
##      science  
##       0.2863
step(m12, direction="both")
## Start:  AIC=736.71
## math ~ id + gender + race + ses + schtyp + prog + read + write + 
##     science + socst
## 
##           Df Sum of Sq    RSS    AIC
## - ses      2      2.76 6851.7 732.79
## - gender   1     11.35 6860.3 735.04
## - socst    1     24.10 6873.0 735.41
## - schtyp   1     41.84 6890.8 735.92
## <none>                 6848.9 736.71
## - id       1     72.31 6921.2 736.81
## - race     3    251.00 7099.9 737.90
## - write    1    225.92 7074.8 741.20
## - prog     2    493.14 7342.1 746.61
## - read     1    499.61 7348.5 748.79
## - science  1    548.59 7397.5 750.12
## 
## Step:  AIC=732.79
## math ~ id + gender + race + schtyp + prog + read + write + science + 
##     socst
## 
##           Df Sum of Sq    RSS    AIC
## - gender   1     12.87 6864.5 731.16
## - socst    1     28.62 6880.3 731.62
## - schtyp   1     39.66 6891.3 731.94
## <none>                 6851.7 732.79
## - id       1     71.99 6923.7 732.88
## - race     3    254.54 7106.2 734.08
## + ses      2      2.76 6848.9 736.71
## - write    1    223.57 7075.2 737.21
## - prog     2    501.88 7353.6 742.92
## - read     1    498.21 7349.9 744.82
## - science  1    555.39 7407.1 746.37
## 
## Step:  AIC=731.16
## math ~ id + race + schtyp + prog + read + write + science + socst
## 
##           Df Sum of Sq    RSS    AIC
## - socst    1     28.48 6893.0 729.99
## - schtyp   1     44.02 6908.6 730.44
## <none>                 6864.5 731.16
## - id       1     82.68 6947.2 731.56
## - race     3    262.00 7126.5 732.65
## + gender   1     12.87 6851.7 732.79
## + ses      2      4.28 6860.3 735.04
## - write    1    216.83 7081.4 735.38
## - prog     2    511.73 7376.3 741.54
## - read     1    530.98 7395.5 744.06
## - science  1    625.46 7490.0 746.60
## 
## Step:  AIC=729.99
## math ~ id + race + schtyp + prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## - schtyp   1     50.96 6944.0 729.46
## <none>                 6893.0 729.99
## - id       1     95.12 6988.1 730.73
## + socst    1     28.48 6864.5 731.16
## - race     3    253.37 7146.4 731.21
## + gender   1     12.73 6880.3 731.62
## + ses      2      9.72 6883.3 733.71
## - write    1    303.61 7196.6 736.61
## - prog     2    569.55 7462.6 741.87
## - science  1    625.17 7518.2 745.35
## - read     1    695.81 7588.8 747.22
## 
## Step:  AIC=729.46
## math ~ id + race + prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## - id       1     45.51 6989.5 728.77
## <none>                 6944.0 729.46
## - race     3    213.13 7157.1 729.51
## + schtyp   1     50.96 6893.0 729.99
## + socst    1     35.41 6908.6 730.44
## + gender   1     17.43 6926.5 730.96
## + ses      2      4.87 6939.1 733.32
## - write    1    288.99 7233.0 735.62
## - prog     2    552.13 7496.1 740.76
## - read     1    656.84 7600.8 745.54
## - science  1    698.22 7642.2 746.62
## 
## Step:  AIC=728.77
## math ~ race + prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## - race     3    173.13 7162.6 727.66
## <none>                 6989.5 728.77
## + id       1     45.51 6944.0 729.46
## + socst    1     41.22 6948.3 729.59
## + gender   1     24.26 6965.2 730.07
## + schtyp   1      1.35 6988.1 730.73
## + ses      2      8.60 6980.9 732.52
## - write    1    290.10 7279.6 734.90
## - prog     2    637.13 7626.6 742.22
## - read     1    612.74 7602.2 743.58
## - science  1    762.98 7752.5 747.49
## 
## Step:  AIC=727.66
## math ~ prog + read + write + science
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 7162.6 727.66
## + race     3    173.13 6989.5 728.77
## + socst    1     22.75 7139.9 729.03
## + gender   1     22.67 7140.0 729.03
## + id       1      5.51 7157.1 729.51
## + schtyp   1      2.91 7159.7 729.58
## + ses      2     13.79 7148.8 731.28
## - write    1    395.93 7558.6 736.42
## - prog     2    615.92 7778.5 740.16
## - read     1    581.00 7743.6 741.26
## - science  1    845.74 8008.4 747.98
## 
## Call:
## lm(formula = math ~ prog + read + write + science, data = hsb)
## 
## Coefficients:
##  (Intercept)   proggeneral  progvocation          read         write  
##      16.5056       -3.7924       -4.1233        0.2401        0.2015  
##      science  
##       0.2863