Example: Lottery Sales — Continued – Actuarial Science and Analytics Resources

Figure 2.7 exhibits an outlier; the point in the upper left-hand side of the plot represents a zip code that includes Kenosha, Wisconsin. Sales for this zip code are unusually high given its population. Kenosha is close to the Illinois border; residents from Illinois probably participate in the Wisconsin lottery thus effectively increasing the potential pool of sales in Kenosha. Table 2.7 summarizes the regression fit both with and without this zip code.

begin{matrix}
begin{array}{c}
text{Table 2.7 Regression Results with and without Kenosha}
end{array}\small
begin{array}{l|rrrrr} hline text{Data} & b_0 & b_1 & s & R^2(%) & t(b_1) \ hline text{With Kenosha} & 469.7 & 0.647 & 3,792 & 78.5 & 13.26 \ text{Without Kenosha} & -43.5 & 0.662 & 2,728 & 88.3 & 18.82 \ hline end{array}
end{matrix}
[raw]

See R Code in Action

[/raw]

R Code and Output for Table 2.7

R-Code
model.basiclinearreg <-lm(SALES ~ POP, Lot)
summary(model.basiclinearreg)
model.Kenosha <-lm(SALES ~ POP, Lot, subset=-c(9))
summary(model.Kenosha)

R-Code Output
> model.basiclinearreg <<-lm(SALES ~ POP, Lot)
> summary(model.basiclinearreg)

Call:
lm(formula = SALES ~ POP, data = Lot)

Residuals:
   Min     1Q Median     3Q    Max 
 -6047  -1461   -670    486  18229 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 469.7036   702.9062    0.67     0.51    
POP           0.6471     0.0488   13.26   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3790 on 48 degrees of freedom
Multiple R-squared:  0.785,	Adjusted R-squared:  0.781 
F-statistic:  176 on 1 and 48 DF,  p-value: <2e-16

> model.Kenosha<-lm(SALES ~ POP, Lot, subset=-c(9))
> summary(model.Kenosha)

Call:
lm(formula = SALES ~ POP, data = Lot, subset = -c(9))

Residuals:
   Min     1Q Median     3Q    Max 
 -6089  -1001   -193    816   7878 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -43.4640   511.2931   -0.09     0.93    
POP           0.6621     0.0352   18.82   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2730 on 47 degrees of freedom
Multiple R-squared:  0.883,	Adjusted R-squared:  0.88 
F-statistic:  354 on 1 and 47 DF,  p-value: <2e-16

F2PlotWithKenosha — Figure 2.7 Scatter plot of SALES versus POP, with the outlier corresponding to Kenosha marked.

R Code for Figure 2.7

R-Code
par(mar=c(4.1,3.9,2,1),cex=1.1)
plot(POP, SALES, ylab="", las=1)
mtext("SALES", side=2, at=36000,cex=1.1, las=1)
text(5000, 24000, "Kenosha")

For the purposes of inference about the slope, the presence of Kenosha does not alter the results dramatically. Both slope estimates are qualitatively similar and the corresponding (t)-statistics are very high, well above cut-offs for statistical significance. However, there are dramatic differences when assessing the quality of the fit. The coefficient of determination, (R^2), increased from 78.5% to 88.3% when deleting Kenosha. Moreover, our "typical deviation" (s) dropped by over $1,000. This is particularly important if we wish to tighten our prediction intervals.

To check the accuracy of our assumptions, it is also customary to check the normality assumption. One way of doing this is the (qq) plot, introduced in Section 1.2. The two panels in Figures 2.8 are (qq) plots with and without the Kenosha zip code. Recall that points "close" to linear indicate approximate normality. In the right-hand panel of Figure 2.8, the sequence does appear to be linear so that residuals are approximately normally distributed. This is not the case in the left-hand panel, where the sequence of points appears to climb dramatically for large quantiles. The interesting thing is that the non-normality of the distribution is due to a single outlier, not a pattern of skewness that is common to all the observations.

F2QQplotsKenosha — Figure 2.8 (qq) Plots of Wisconsin Lottery Residuals. The left-hand panel is based on all 50 points. The right-hand panel is based on 49 points, residuals from a regression after removing Kenosha.

R Code for Figure 2.8

R-Code

par(mfrow=c(1, 2), mar=c(4.1,3.9,1.7,1),cex=1.1)
qqnorm(residuals(model.basiclinearreg), main="", ylab="", las=1)
mtext("Sample Quantiles", side=2,at=20500,las=1,cex=1.1, adj=.5)
qqnorm(residuals(model.Kenosha), main="", ylab="", las=1)
mtext("Sample Quantiles", side=2,at=9050,las=1,cex=1.1, adj=.5)

[WpProQuiz 14]

◄ Previous page

Next page ►