{"id":3245,"date":"2015-04-11T22:01:39","date_gmt":"2015-04-12T03:01:39","guid":{"rendered":"http:\/\/www.ssc.wisc.edu\/~jfrees\/?page_id=3245"},"modified":"2023-06-09T13:48:02","modified_gmt":"2023-06-09T18:48:02","slug":"2-1-correlations-and-least-squares","status":"publish","type":"page","link":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/basic-linear-regression\/2-1-correlations-and-least-squares\/","title":{"rendered":"2.1 Correlations and Least Squares"},"content":{"rendered":"<div class=\"scbb-content-box scbb-content-box-gray\">In this section, you learn how to: \n<ul>\n<li>Calculate and interpret a correlation coefficient<\/li>\n<li>Interpret correlation coefficients by visualizing related scatter plots<\/li>\n<li>Fit a line to data using the method of least squares<\/li>\n<li>Predict an observation using a least squares fitted line<\/li>\n<\/ul>\n<h2 style=\"text-align: center\"><a href=\"http:\/\/flash.bus.wisc.edu\/data\/act_sci\/Frees\/Regression2015\/Chapter2\/Part1\/CorrlnLeastSquares.html\" target=\"_blank\" rel=\"noopener\">Video Overview of the Section <\/a><a href=\"http:\/\/flash.bus.wisc.edu\/data\/act_sci\/Frees\/Regression2015\/Chapter2\/Part1\/CorrlnLeastSquares.mp4\" target=\"_blank\" rel=\"noopener\">(<em>Alternative .mp4 Version &#8211; 13:59 min<\/em>)<\/a><\/h2>\n<p><\/p><\/div>\n<p>Regression is about relationships. Specifically, we will study how two variables, an <em>x<\/em> and a <em>y<\/em>, are related. We want to be able to answer questions such as, if we change the level of <em>x<\/em>, what will happen to the level of <em>y<\/em>? If we compare two &#8220;subjects&#8221; that appear similar except for the <em>x<\/em> measurement, how will their <em>y<\/em> measurements differ? Understanding relationships among variables is critical for quantitative management, particularly in actuarial science where uncertainty is so prevalent. <\/p>\n<p> It is helpful to work with a specific example to become familiar with key concepts. Analysis of lottery sales has not been part of traditional actuarial practice but it is a growth area in which actuaries could contribute. <\/p>\n<p> <strong>Example: Wisconsin Lottery Sales.<\/strong> State of Wisconsin lottery administrators are interested in assessing factors that affect lottery sales. Sales consists of online lottery tickets that are sold by selected retail establishments in Wisconsin. These tickets are generally priced at $1.00, so the number of tickets sold equals the lottery revenue. We analyze average lottery sales (<em>SALES<\/em>) over a forty-week period, April, 1998 through January, 1999, from fifty randomly selected areas identified by postal (ZIP) code within the state of Wisconsin. <\/p>\n<p> Although many economic and demographic variables might influence sales, our first analysis focuses on population (<em>POP<\/em>) as a key determinant. Chapter 3 will show how to consider additional explanatory variables. Intuitively, it seems clear that geographic areas with more people will have higher sales. So, other things being equal, a larger <em>x=POP<\/em> means a larger <em>y=SALES<\/em>. However, the lottery is an important source of revenue for the state and we want to be as precise as possible. <\/p>\n<p>A little additional notation will be useful subsequently. In this sample, there are fifty geographic areas and we use subscripts to identify each area. For example, \\(y_1\\) = 1,285.4 represents sales for the first area in the sample that has population \\(x_1\\) = 435. Call the ordered pair (\\(x_1\\), \\(y_1\\)) = (435, 1285.4) the first <em>observation<\/em>. Extending this notation, the entire sample containing fifty observations may be represented by (\\(x_1\\), \\(y_1\\)), &#8230;, (\\(x_{50}\\), \\(y_{50}\\)). The ellipses ( &#8230; ) mean that the pattern is continued until the final object is encountered. We will often speak of a generic member of the sample, referring to (\\(x_i\\), \\(y_i\\)) as the \\(i\\)th observation. <\/p>\n<p> Data sets can get complicated, so it will help if you begin by working with each variable separately. The two panels in Figure 2.1 show histograms that give a quick visual impression of the distribution of each variable in isolation of the other. Table 2.1 provides corresponding numerical summaries.  To illustrate, for the population variable (POP), we see that the area with the smallest number contained 280 people whereas the largest contained 39,098. The average, over 50 ZIP codes, was 9,311.04. For our second variable, sales were as low as $189 and as high as $33,181. <\/p>\n<figure id=\"attachment_269\" class=\"wp-caption aligncenter\" style=\"max-width: 300px;\" aria-label=\"Figure 2.1 Histograms of Population and Sales.     Each distribution is skewed to the right, indicating that there are many small areas compared to a few areas with larger sales and populations.\"><a href=\"http:\/\/www.ssc.wisc.edu\/~jfrees\/wp-content\/uploads\/2015\/04\/F2HistPopSales.png\"><img decoding=\"async\" loading=\"lazy\" src=\"http:\/\/www.ssc.wisc.edu\/~jfrees\/wp-content\/uploads\/2015\/04\/F2HistPopSales.png\" alt=\"F2HistPopSales\" width=\"432\" height=\"288\" class=\"aligncenter size-full wp-image-3258\" srcset=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-content\/uploads\/2015\/04\/F2HistPopSales.png 432w, https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-content\/uploads\/2015\/04\/F2HistPopSales-300x200.png 300w\" sizes=\"(max-width: 432px) 100vw, 432px\" \/><\/a><figcaption class=\"wp-caption-text\">Figure 2.1 Histograms of Population and Sales.     Each distribution is skewed to the right, indicating that there are many small areas compared to a few areas with larger sales and populations.<\/figcaption><\/figure>\n<h2 style=\"text-align: center;\"><a id=\"displayText2.1f\" href=\"javascript:togglecode('toggleText2.1f','displayText2.1f');\"><i><strong>R Code for Figure 2.1<\/strong><\/i><\/a> <\/h2>\n<div id=\"toggleText2.1f\" style=\"display: none\">\n<pre>\r\n<strong>R-Code<\/strong>\r\nLot &lt;- read.csv(\"http:\/\/instruction.bus.wisc.edu\/jfrees\/jfreesbooks\/Regression%20Modeling\/BookWebDec2010\/CSVData\/WiscLottery.csv\",header=TRUE)\r\nattach(Lot)\r\nnames(Lot)\r\n\r\npar(mfrow=c(1, 2), cex=1.3, mar=c(4.1,3.1,1.2,1))\r\nhist(POP, main=\"\", ylab=\"\", las=1)\r\nmtext(\"Frequency\", side=2, at=30, las=1, cex=1.3, adj=.6)\r\nhist(SALES, main=\"\", ylab=\"\", las=1)\r\nmtext(\"Frequency\", side=2, at=34, las=1, cex=1.3, adj=.6)\r\n<\/pre>\n<\/div>\n<div class=\"scbb-content-box scbb-content-box-gray\">\\begin{matrix}<br \/>\n\\begin{array}{c}<br \/>\n\\text{Table 2.1 Summary Statistics of Each Variable}<br \/>\n\\end{array}\\\\\\small<br \/>\n \\begin{array}{lrrrrr} \\hline &#038;  &#038;  &#038; \\text{Standard} &#038;  &#038;  \\\\ \\text{Variable} &#038; \\text{Mean} &#038; \\text{Median} &#038; \\text{Deviation} &#038; \\text{Minimum} &#038; \\text{Maximum} \\\\ \\hline \\text{POP} &#038; 9,311 &#038; 4,406 &#038; 11,098 &#038; 280 &#038; 39,098 \\\\ \\text{SALES} &#038; 6,495 &#038; 2,426 &#038; 8,103 &#038; 189 &#038; 33,181 \\\\ \\hline \\end{array}\\\\\\scriptsize<br \/>\n\\begin{array}{c}<br \/>\n Source: Frees\\ and\\ Miller\\ (2003).<br \/>\n\\end{array}\\end{matrix} <\/div>\n\r\n<h2 style=\"text-align: center;\"><a id=\"displayTextf8.3\" href=\"javascript:togglecode('toggleTextf8.3','displayTextf8.3');\"><i><strong>See R Code in Action<\/strong><\/i><\/a><\/h2><div class=\"sage-r\" id=\"toggleTextf8.3\" style=\"display: block\"><script type=\"text\/x-sage\">\r\nLot <- read.csv(\"http:\/\/instruction.bus.wisc.edu\/jfrees\/jfreesbooks\/Regression%20Modeling\/BookWebDec2010\/CSVData\/WiscLottery.csv\",header=TRUE)\r\nattach(Lot)\r\noptions(digits=5)\r\nXymat <- data.frame(cbind(POP,SALES))   \r\nmeanSummary <- sapply(Xymat, mean,  na.rm=TRUE) \r\nsdSummary   <- sapply(Xymat, sd,    na.rm=TRUE) \r\nminSummary  <- sapply(Xymat, min,   na.rm=TRUE) \r\nmaxSummary  <- sapply(Xymat, max,   na.rm=TRUE) \r\nmedSummary  <- sapply(Xymat, median,na.rm=TRUE) \r\nsummvar <- cbind(meanSummary, medSummary, sdSummary, minSummary, maxSummary)\r\nsummvar\r\n<\/script><\/div>\r\n\n<h2 style=\"text-align: center;\"><a id=\"displayText2.11t\" href=\"javascript:togglecode('toggleText2.11t','displayText2.11t');\"><i><strong>R Code for Table 2.1<\/strong><\/i><\/a> <\/h2>\n<div id=\"toggleText2.11t\" style=\"display: none\">\n<pre>\r\n<strong>R-Code<\/strong>\r\noptions(digits=5)\r\nXymat &lt;- data.frame(cbind(POP,SALES))   \r\nmeanSummary &lt;- sapply(Xymat, mean,  na.rm=TRUE) \r\nsdSummary   &lt;- sapply(Xymat, sd,    na.rm=TRUE) \r\nminSummary  &lt;- sapply(Xymat, min,   na.rm=TRUE) \r\nmaxSummary  &lt;- sapply(Xymat, max,   na.rm=TRUE) \r\nmedSummary  &lt;- sapply(Xymat, median,na.rm=TRUE) \r\nsummvar &lt;- cbind(meanSummary, medSummary, sdSummary, minSummary, maxSummary)\r\nsummvar\r\n<\/pre>\n<pre>\r\n<strong>R-Code Output<\/strong>\r\n> Lot &lt;- read.csv(\"http:\/\/instruction.bus.wisc.edu\/jfrees\/jfreesbooks\/Regression%20Modeling\/BookWebDec2010\/CSVData\/WiscLottery.csv\",header=TRUE)\r\n> attach(Lot)\r\n> names(Lot)\r\n [1] \"ZIP\"      \"PERPERHH\" \"MEDSCHYR\" \"MEDHVL\"   \"PRCRENT\"  \"PRC55P\"   \"HHMEDAGE\"\r\n [8] \"MEDINC\"   \"SALES\"    \"POP\"     \r\n> options(digits=5)\r\n> Xymat &lt;- data.frame(cbind(POP,SALES))\r\n> meanSummary &lt;- sapply(Xymat, mean,  na.rm=TRUE)\r\n> sdSummary   &lt;- sapply(Xymat, sd,    na.rm=TRUE)\r\n> minSummary  &lt;- sapply(Xymat, min,   na.rm=TRUE)\r\n> maxSummary  &lt;- sapply(Xymat, max,   na.rm=TRUE)\r\n> medSummary  &lt;- sapply(Xymat, median,na.rm=TRUE)\r\n> summvar &lt;- cbind(meanSummary, medSummary, sdSummary, minSummary, maxSummary)\r\n> summvar\r\n      meanSummary medSummary sdSummary minSummary maxSummary\r\nPOP        9311.0     4405.5     11098        280      39098\r\nSALES      6494.8     2426.4      8103        189      33181\r\n<\/pre>\n<\/div>\n<p> As Table 2.1 shows, the basic summary statistics give useful ideas of the structure of key features of the data. After we understand the information in each variable in isolation of the other, we can begin exploring the relationship between the two variables. <\/p>\n<p><div class=\"alignleft\"><a href=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/basic-linear-regression\/\" title=\"Chapter 2. Basic Linear Regression\">&#9668 Previous page<\/a><\/div><div class=\"alignright\"><a href=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/basic-linear-regression\/2-1-correlations-and-least-squares\/scatter-plot-and-correlation-coefficients-basic-summary-tools\/\" title=\"Scatter Plot and Correlation Coefficients &#8211; Basic Summary Tools\">Next page &#9658<\/a><\/div><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this section, you learn how to: Calculate and interpret a correlation coefficient Interpret correlation coefficients by visualizing related scatter plots Fit a line to data using the method of least squares Predict an observation using a least squares fitted line Video Overview of the Section (Alternative .mp4 Version &#8211; 13:59 min) Regression is about&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":3243,"menu_order":1,"comment_status":"closed","ping_status":"open","template":"","meta":{"jetpack_post_was_ever_published":false},"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/P8cLPd-Ql","acf":[],"_links":{"self":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3245"}],"collection":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/comments?post=3245"}],"version-history":[{"count":29,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3245\/revisions"}],"predecessor-version":[{"id":6517,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3245\/revisions\/6517"}],"up":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3243"}],"wp:attachment":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/media?parent=3245"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}