{"id":3443,"date":"2015-04-12T02:05:31","date_gmt":"2015-04-12T07:05:31","guid":{"rendered":"http:\/\/www.ssc.wisc.edu\/~jfrees\/?page_id=3443"},"modified":"2015-08-21T13:45:15","modified_gmt":"2015-08-21T18:45:15","slug":"4-1-the-role-of-binary-variables","status":"publish","type":"page","link":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/chapter-4-multiple-linear-regression-ii\/4-1-the-role-of-binary-variables\/","title":{"rendered":"4.1 The Role of Binary Variables"},"content":{"rendered":"<div class=\"scbb-content-box scbb-content-box-gray\">In this section, you learn how to: \n<ul>\n<li>Represent categorical variables using a set of binary variables<\/li>\n<li>Interpret the regression coefficients associated with categorical variables<\/li>\n<li>Describe the effect of the reference level choice on the model fit<\/li>\n<\/ul>\n<h2 style=\"text-align: center\"><a href=\"http:\/\/flash.bus.wisc.edu\/data\/act_sci\/Frees\/Regression2015\/Chapter4\/Part1\/BinaryVariables.html\" target=\"_blank\">Video Overview of the Section <\/a><a href=\"http:\/\/flash.bus.wisc.edu\/data\/act_sci\/Frees\/Regression2015\/Chapter4\/Part1\/BinaryVariables.mp4\" target=\"_blank\">(<em>Alternative .mp4 Version &#8211; 11:50 min<\/em>)<\/a><\/h2>\n<p><\/p><\/div>\n<p><em>Categorical variables<\/em> provide labels for observations to denote membership in distinct groups, or categories. A binary variable is a special case of a categorical variable. To illustrate, a binary variable may tell us whether or not someone has health insurance. A categorical variable could tell us whether someone has <\/p>\n<ul>\n<li> private group insurance (offered by employers and associations), <\/li>\n<li> private individual health insurance (through insurance companies), <\/li>\n<li> public insurance (such as Medicare or Medicaid) or <\/li>\n<li> no health insurance. <\/li>\n<\/ul>\n<p>For categorical variables, there may or may not be an ordering of the groups. In health insurance, it is difficult to order these four categories and say which is &#8220;larger,&#8221; private group, private individual, public or no health insurance. In contrast, for education, we might group individuals into &#8220;low,&#8221; &#8220;intermediate&#8221; and &#8220;high&#8221; years of education. In this case, there is an ordering among groups based on level of educational achievement. As we will see, this ordering may or may not provide information about the dependent variable. <em>Factor<\/em> is another term used for an unordered categorical explanatory variable. <\/p>\n<p>  For ordered categorical variables, analysts typically assign a numerical score to each outcome and treat the variable as if it were continuous. For example, if we had three levels of education, we might employ ranks and use \\begin{equation*} EDUCATION = \\left\\{ \\begin{array}{cl}         1           &#038; \\textrm{for low education} \\\\         2           &#038; \\textrm{for intermediate education} \\\\         3           &#038; \\textrm{for high education.} \\\\ \\end{array} \\right. \\end{equation*} An alternative would be to use a numerical score that approximates an underlying value of the category. For example, we might use \\begin{equation*} EDUCATION = \\left\\{ \\begin{array}{cl}         6           &#038; \\textrm{for low education} \\\\         10           &#038; \\textrm{for intermediate education} \\\\         14           &#038; \\textrm{for high education.} \\\\ \\end{array} \\right. \\end{equation*} This gives the approximate number of years of schooling that individuals in each category completed. <\/p>\n<p> The assignment of numerical scores and treating the variable as continuous has important implications for the regression modeling interpretation. Recall that the regression coefficient is the marginal change in the expected response; in this case, the \\(\\beta\\) for education assesses the increase in E \\(y\\) per unit change in <em>EDUCATION<\/em>. If we record <em>EDUCATION<\/em> as a rank in a regression model, then the \\(\\beta\\) for education corresponds to the increase in E \\(y\\) moving from <em>EDUCATION<\/em>=1 to <em>EDUCATION<\/em>=2 (from low to intermediate); this increase is the same as moving from <em>EDUCATION<\/em>=2 to <em>EDUCATION<\/em>=3 (from intermediate to high). Do we want to model this increase as the same? This is an assumption that the analyst makes with this coding of <em>EDUCATION<\/em>; it may or may not be valid but certainly needs to be recognized. <\/p>\n<p>  Because of this interpretation of coefficients, analysts rarely use ranks or other numerical scores to summarize <em>unordered<\/em> categorical variables. The most direct way of handling factors in regression is through the use of binary variables. A categorical variable with <em>c<\/em> levels can be represented using <em>c<\/em> binary variables, one for each category. For example, suppose that we were uncertain about the direction of the education effect and so decide to treat it as a factor. Then, we could code <em>c<\/em>=3 binary variables: (1) a variable to indicate low education, (2) one to indicate intermediate education and (3) one to indicate high education. These binary variables are often known as <em>dummy variables<\/em>. In regression analysis with an intercept term, we use only <em>c<\/em>-1 of these binary variables; the remaining variable enters implicitly through the intercept term. By identifying a variable as a factor, most statistical software packages will automatically create binary variables for you. <\/p>\n<p>  Through the use of binary variables, we do not make use of the ordering of categories within a factor. Because no assumption is made regarding the ordering of the categories, for the model fit it does not matter which variable is dropped with regard to the fit of the model. However, it does matter for the interpretation of the regression coefficients. <\/p>\n<p> <div class=\"alignleft\"><a href=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/chapter-4-multiple-linear-regression-ii\/\" title=\"Chapter 4. Multiple Linear Regression &#8211; II\">&#9668 Previous page<\/a><\/div><div class=\"alignright\"><a href=\"https:\/\/users.ssc.wisc.edu\/~ewfrees\/regression\/chapter-4-multiple-linear-regression-ii\/4-1-the-role-of-binary-variables\/example-term-life-insurance-continued\/\" title=\"Example: Term Life Insurance &#8211; Continued\">Next page &#9658<\/a><\/div><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Categorical variables provide labels for observations to denote membership in distinct groups, or categories. A binary variable is a special case of a categorical variable. To illustrate, a binary variable may tell us whether or &hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":3441,"menu_order":1,"comment_status":"closed","ping_status":"open","template":"","meta":{"jetpack_post_was_ever_published":false},"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/P8cLPd-Tx","acf":[],"_links":{"self":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3443"}],"collection":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/comments?post=3443"}],"version-history":[{"count":13,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3443\/revisions"}],"predecessor-version":[{"id":4112,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3443\/revisions\/4112"}],"up":[{"embeddable":true,"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/pages\/3441"}],"wp:attachment":[{"href":"https:\/\/users.ssc.wisc.edu\/~ewfrees\/wp-json\/wp\/v2\/media?parent=3443"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}