{"id":4792,"date":"2015-08-16T09:33:01","date_gmt":"2015-08-16T14:33:01","guid":{"rendered":"http:\/\/www.ssc.wisc.edu\/~jfrees\/?page_id=4792"},"modified":"2015-08-18T13:29:58","modified_gmt":"2015-08-18T18:29:58","slug":"example-outliers-and-high-leverage-points","status":"publish","type":"page","link":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/basic-linear-regression\/2-6-building-a-better-model-residual-analysis\/example-outliers-and-high-leverage-points\/","title":{"rendered":"Example: Outliers and High Leverage Points"},"content":{"rendered":"<p>Consider the fictitious data set of 19 points plus three points, labeled A, B, and C, given in Figure 2.6 and Table 2.5. Think of the first 19 points as &#8220;good&#8221; observations that represent some type of phenomena. We want to investigate the effect of adding a single aberrant point. <\/p>\n<p> \\begin{matrix}<br \/>\n\\begin{array}{c}\\text{Table 2.5 19 Base Points Plus Three Types of Unusual Observations}<br \/>\n\\end{array}\\\\\\scriptsize<br \/>\n\\begin{array}{ccl}<br \/>\n\\hline \\text{Variables} &#038; \\phantom{XXXXXXXXXXXXX}\\text{19 Base Points}\\phantom{XXXXXXXXXXX} &#038; \\phantom{X}A\\phantom{XX} B\\phantom{XX} C\\phantom{X}<br \/>\n\\end{array}\\\\\\scriptsize<br \/>\n \\begin{array}{c|cccccccccc|ccc} \\hline \\phantom{Vari}x\\phantom{Vari}&#038; 1.5 &#038; 1.7 &#038; 2.0 &#038; 2.2 &#038; 2.5 &#038; 2.5 &#038; 2.7 &#038; 2.9 &#038; 3.0 &#038; 3.5 &#038; 3.4 &#038; 9.5 &#038; 9.5 \\\\ y &#038; 3.0 &#038; 2.5 &#038; 3.5 &#038; 3.0 &#038; 3.1 &#038; 3.6 &#038; 3.2 &#038; 3.9 &#038; 4.0 &#038; 4.0 &#038; 8.0 &#038; 8.0 &#038; 2.5 \\\\ \\hline x &#038; 3.8 &#038; 4.2 &#038; 4.3 &#038; 4.6 &#038; 4.0 &#038; 5.1 &#038; 5.1 &#038; 5.2 &#038; 5.5 &#038;  &#038;  &#038;  &#038;  \\\\ y &#038; 4.2 &#038; 4.1 &#038; 4.8 &#038; 4.2 &#038; 5.1 &#038; 5.1 &#038; 5.1 &#038; 4.8 &#038; 5.3 &#038;  &#038;  &#038;  &#038;  \\\\ \\hline \\end{array}<br \/>\n \\end{matrix} <\/p>\n<figure class=\"wp-caption aligncenter\" style=\"max-width: 300px;\" aria-label=\"Figure 2.6 Scatterplot of 19 base plus three unusual points, labeled A, B and C.\"><a href=\"http:\/\/www.ssc.wisc.edu\/~jfrees\/wp-content\/uploads\/2015\/04\/F2Outlier.png\"><img decoding=\"async\" loading=\"lazy\" src=\"http:\/\/www.ssc.wisc.edu\/~jfrees\/wp-content\/uploads\/2015\/04\/F2Outlier.png\" alt=\"F2Outlier\" width=\"432\" height=\"288\" class=\"aligncenter size-full wp-image-3261\" srcset=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-content\/uploads\/2015\/04\/F2Outlier.png 432w, https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-content\/uploads\/2015\/04\/F2Outlier-300x200.png 300w\" sizes=\"(max-width: 432px) 100vw, 432px\" \/><\/a><figcaption class=\"wp-caption-text\">Figure 2.6 Scatterplot of 19 base plus three unusual points, labeled A, B and C.<\/figcaption><\/figure>\n<h2 style=\"text-align: center;\"><a id=\"displayText2.6f\" href=\"javascript:togglecode('toggleText2.6f','displayText2.6f');\"><i><strong>R Code for Figure 2.6<\/strong><\/i><\/a> <\/h2>\n<div id=\"toggleText2.6f\" style=\"display: none\">\n<pre>\r\n<strong>R-Code<\/strong>\r\npar(mar=c(4.1,3.1,1.1,.1), cex=1.3)\r\nplot(OUTLR$X, OUTLR$Y, xlab=\"x\", ylab=\"\", xlim=c(0, 10), ylim=c(2, 9), las=1)\r\nmtext(\"y\", at=5.5,side=2,las=1,cex=1.3, line=2.3)\r\npoints(4.3, 8.0)\r\ntext(4.7, 8.0, \"A\", cex=1.3)\r\npoints(9.5, 8.0)\r\ntext(9.9, 8.0, \"B\", cex=1.3)\r\npoints(9.5, 2.5)\r\ntext(9.9, 2.5, \"C\", cex=1.3)\r\n<\/pre>\n<\/div>\n<p> To investigate the effect of each type of aberrant point, Table 2.6 summarizes the results of four separate regressions. The first regression is for the nineteen base points. The other three regressions use the nineteen base points plus each type of unusual observation. <\/p>\n<p> \\begin{matrix}<br \/>\n\\begin{array}{c}<br \/>\n\\text{Table 2.6 Results from Four Regressions}<br \/>\n\\end{array}\\\\\\scriptsize<br \/>\n\\begin{array}{l|rrrrr} \\hline \\text{Data} &#038; b_0 &#038; b_1 &#038; s &#038; R^2(\\%) &#038; t(b_1) \\\\ \\hline \\text{19 Base Points} &#038; 1.869 &#038; 0.611 &#038; 0.288 &#038; 89.0 &#038; 11.71 \\\\ \\text{19 Base Points} ~+~ A &#038; 1.750 &#038; 0.693 &#038; 0.846 &#038; 53.7 &#038; 4.57 \\\\ \\text{19 Base Points} ~+~ B &#038; 1.775 &#038; 0.640 &#038; 0.285 &#038; 94.7 &#038; 18.01 \\\\ \\text{19 Base Points} ~+~ C &#038; 3.356 &#038; 0.155 &#038; 0.865 &#038; 10.3 &#038; 1.44 \\\\ \\hline \\end{array}<br \/>\n \\end{matrix}<br \/>\n\r\n<h2 style=\"text-align: center;\"><a id=\"displayTextf8.3\" href=\"javascript:togglecode('toggleTextf8.3','displayTextf8.3');\"><i><strong>See R Code in Action<\/strong><\/i><\/a><\/h2><div class=\"sage-r\" id=\"toggleTextf8.3\" style=\"display: block\"><script type=\"text\/x-sage\">\r\nOUTLR <- read.csv(\"http:\/\/instruction.bus.wisc.edu\/jfrees\/jfreesbooks\/Regression%20Modeling\/BookWebDec2010\/CSVData\/OutlierExample.csv\",header=TRUE)\r\nstr(OUTLR)\r\nmodel.outlr0 <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,21,22))\r\nsummary(model.outlr0)\r\nmodel.outlrA <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(21,22))\r\nsummary(model.outlrA)\r\nmodel.outlrB <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,22))\r\nsummary(model.outlrB)\r\nmodel.outlrC <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,21))\r\nsummary(model.outlrC)\r\n<\/script><\/div>\r\n<\/p>\n<h2 style=\"text-align: center;\"><a id=\"displayText2.66t\" href=\"javascript:togglecode('toggleText2.66t','displayText2.66t');\"><i><strong>R Code and Output for Table 2.6<\/strong><\/i><\/a> <\/h2>\n<div id=\"toggleText2.66t\" style=\"display: none\">\n<pre>\r\n<strong>R-Code<\/strong>\r\nOUTLR &lt;- read.csv(\"http:\/\/instruction.bus.wisc.edu\/jfrees\/jfreesbooks\/Regression%20Modeling\/BookWebDec2010\/CSVData\/OutlierExample.csv\",header=TRUE)\r\nstr(OUTLR)\r\nmodel.outlr0 &lt;- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,21,22))\r\nsummary(model.outlr0)\r\nmodel.outlrA &lt;- lm(OUTLR$Y ~ OUTLR$X, subset=-c(21,22))\r\nsummary(model.outlrA)\r\nmodel.outlrB &lt;- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,22))\r\nsummary(model.outlrB)\r\nmodel.outlrC &lt;- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,21))\r\nsummary(model.outlrC)\r\n<\/pre>\n<pre>\r\n<strong>R-Code Output<\/strong>\r\n> OUTLR &lt;- read.csv(\"http:\/\/instruction.bus.wisc.edu\/jfrees\/jfreesbooks\/Regression%20Modeling\/BookWebDec2010\/CSVData\/OutlierExample.csv\",header=TRUE)\r\n> str(OUTLR)\r\n'data.frame':\t22 obs. of  3 variables:\r\n $ X    : num  1.5 1.7 2 2.2 2.5 ...\r\n $ Y    : num  3 2.5 3.5 3 3.1 ...\r\n $ CODES: int  0 0 0 0 0 0 0 0 0 0 ...\r\n> model.outlr0 <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,21,22))\r\n> summary(model.outlr0)\r\n\r\nCall:\r\nlm(formula = OUTLR$Y ~ OUTLR$X, subset = -c(20, 21, 22))\r\n\r\nResiduals:\r\n    Min      1Q  Median      3Q     Max \r\n-0.4791 -0.2709  0.0711  0.2263  0.4094 \r\n\r\nCoefficients:\r\n            Estimate Std. Error t value Pr(>|t|)    \r\n(Intercept)   1.8687     0.1958    9.54  3.1e-08 ***\r\nOUTLR$X       0.6109     0.0522   11.71  1.5e-09 ***\r\n---\r\nSignif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\r\n\r\nResidual standard error: 0.288 on 17 degrees of freedom\r\nMultiple R-squared:  0.89,\tAdjusted R-squared:  0.883 \r\nF-statistic:  137 on 1 and 17 DF,  p-value: 1.47e-09\r\n\r\n> model.outlrA <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(21,22))\r\n> summary(model.outlrA)\r\n\r\nCall:\r\nlm(formula = OUTLR$Y ~ OUTLR$X, subset = -c(21, 22))\r\n\r\nResiduals:\r\n   Min     1Q Median     3Q    Max \r\n-0.739 -0.393 -0.180  0.122  3.269 \r\n\r\nCoefficients:\r\n            Estimate Std. Error t value Pr(>|t|)    \r\n(Intercept)    1.750      0.574    3.05  0.00688 ** \r\nOUTLR$X        0.693      0.152    4.57  0.00024 ***\r\n---\r\nSignif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\r\n\r\nResidual standard error: 0.845 on 18 degrees of freedom\r\nMultiple R-squared:  0.537,\tAdjusted R-squared:  0.511 \r\nF-statistic: 20.9 on 1 and 18 DF,  p-value: 0.000237\r\n\r\n> model.outlrB <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,22))\r\n> summary(model.outlrB)\r\n\r\nCall:\r\nlm(formula = OUTLR$Y ~ OUTLR$X, subset = -c(20, 22))\r\n\r\nResiduals:\r\n    Min      1Q  Median      3Q     Max \r\n-0.5176 -0.2809  0.0345  0.2359  0.4458 \r\n\r\nCoefficients:\r\n            Estimate Std. Error t value Pr(>|t|)    \r\n(Intercept)   1.7746     0.1502    11.8  6.5e-10 ***\r\nOUTLR$X       0.6398     0.0355    18.0  5.8e-13 ***\r\n---\r\nSignif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\r\n\r\nResidual standard error: 0.285 on 18 degrees of freedom\r\nMultiple R-squared:  0.947,\tAdjusted R-squared:  0.945 \r\nF-statistic:  325 on 1 and 18 DF,  p-value: 5.81e-13\r\n\r\n> model.outlrC <- lm(OUTLR$Y ~ OUTLR$X, subset=-c(20,21))\r\n> summary(model.outlrC)\r\n\r\nCall:\r\nlm(formula = OUTLR$Y ~ OUTLR$X, subset = -c(20, 21))\r\n\r\nResiduals:\r\n    Min      1Q  Median      3Q     Max \r\n-2.3295 -0.5782  0.0977  0.6724  1.0910 \r\n\r\nCoefficients:\r\n            Estimate Std. Error t value Pr(>|t|)    \r\n(Intercept)    3.356      0.456    7.36  7.9e-07 ***\r\nOUTLR$X        0.155      0.108    1.44     0.17    \r\n---\r\nSignif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\r\n\r\nResidual standard error: 0.865 on 18 degrees of freedom\r\nMultiple R-squared:  0.103,\tAdjusted R-squared:  0.0533 \r\nF-statistic: 2.07 on 1 and 18 DF,  p-value: 0.167\r\n\r\n<\/pre>\n<\/div>\n<p> Table 2.6 shows that a regression line provides a good fit for the nineteen base points. The coefficient of determination, \\(R^2\\), indicates about 89% of the variability has been explained by the line. The size of the typical error, <em>s<\/em>, is about 0.29, small compared to the scatter in the <em>y<\/em>-values. Further, the <em>t<\/em>-ratio for the slope coefficient is large. <\/p>\n<p> When the outlier point A is added to the nineteen base points, the situation deteriorates dramatically. The \\(R^2\\) drops from 89% to 53.7% and <em>s<\/em> increases from about 0.29 to about 0.85. The fitted regression line itself does not change that much even though our confidence in the estimates has decreased. <\/p>\n<p> An outlier is unusual in the <em>y<\/em>-value, but &#8220;unusual in the <em>y<\/em>-value&#8221; depends on the <em>x<\/em>-value. To see this, keep the <em>y<\/em>-value of Point A the same, but increase the <em>x<\/em>-value and call the point B. <\/p>\n<p> When the point B is added to the nineteen base points, the regression line provides a <em>better<\/em> fit. Point B is close to being on the line of the regression fit generated by the nineteen base points. Thus, the fitted regression line and the size of the typical error, <em>s<\/em>, do not change much. However, \\(R^2\\) increases from 89% to nearly 95 percent. If we think of \\( R^2\\) as \\(1-(Error~SS)\/(Total~SS)\\), by adding point B we have increased \\( Total~SS\\), the total squared deviations in the <em>y<\/em>&#8216;s, even though leaving \\( Error~SS\\) relatively unchanged. Point B is not an outlier, but it is a high leverage point. <\/p>\n<p> To show how influential this point is, drop the <em>y<\/em>-value considerably and call this the new point C. When this point is added to the nineteen base points, the situation deteriorates dramatically. The \\(R^2\\) coefficient drops from 89% to 10%, and the <em>s<\/em> more than triples, from 0.29 to 0.87. Further, the regression line coefficients change dramatically. <\/p>\n<p> Most users of regression at first do not believe that one point in twenty can have such a dramatic effect on the regression fit. The fit of a regression line can always be improved by removing an outlier. If the point is a high leverage point and not an outlier, it is not clear whether the fit will be improved when the point is removed. <\/p>\n<p> Simply because you can dramatically improve a regression fit by omitting an observation does not mean you should always do so! The goal of data analysis is to understand the information in the data. Throughout the text, we will encounter many data sets where the unusual points provide some of the most interesting information about the data. The goal of this subsection is to recognize the effects of unusual points; Chapter 5 will provide <a href=\"http:\/\/www.ssc.wisc.edu\/~jfrees\/?p=3527\">options for handling unusual points<\/a> in your analysis. <\/p>\n<p> All quantitative disciplines, such as accounting, economics, linear programming, and so on, practice the art of <em>sensitivity analysis<\/em>. Sensitivity analysis is a description of the global changes in a system due to a small local change in an element of the system. Examining the effects of individual observations on the regression fit is a type of sensitivity analysis. <\/p>\n<p><div class=\"alignleft\"><a href=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/basic-linear-regression\/2-6-building-a-better-model-residual-analysis\/outliers-and-high-leverage-points\/\" title=\"Outliers and High Leverage Points\">&#9668 Previous page<\/a><\/div><div class=\"alignright\"><a href=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/basic-linear-regression\/2-6-building-a-better-model-residual-analysis\/example-lottery-sales-continued\/\" title=\"Example: Lottery Sales &#8212; Continued\">Next page &#9658<\/a><\/div><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Consider the fictitious data set of 19 points plus three points, labeled A, B, and C, given in Figure 2.6 and Table 2.5. Think of the first 19 points as &#8220;good&#8221; observations that represent some &hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":3378,"menu_order":2,"comment_status":"closed","ping_status":"open","template":"","meta":{"jetpack_post_was_ever_published":false},"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/P8cLPd-1fi","acf":[],"_links":{"self":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/4792"}],"collection":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/comments?post=4792"}],"version-history":[{"count":4,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/4792\/revisions"}],"predecessor-version":[{"id":4907,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/4792\/revisions\/4907"}],"up":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3378"}],"wp:attachment":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/media?parent=4792"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}