4.1 The Role of Binary Variables

In this section, you learn how to:
  • Represent categorical variables using a set of binary variables
  • Interpret the regression coefficients associated with categorical variables
  • Describe the effect of the reference level choice on the model fit

Video Overview of the Section (Alternative .mp4 Version – 11:50 min)

Categorical variables provide labels for observations to denote membership in distinct groups, or categories. A binary variable is a special case of a categorical variable. To illustrate, a binary variable may tell us whether or not someone has health insurance. A categorical variable could tell us whether someone has

  • private group insurance (offered by employers and associations),
  • private individual health insurance (through insurance companies),
  • public insurance (such as Medicare or Medicaid) or
  • no health insurance.

For categorical variables, there may or may not be an ordering of the groups. In health insurance, it is difficult to order these four categories and say which is “larger,” private group, private individual, public or no health insurance. In contrast, for education, we might group individuals into “low,” “intermediate” and “high” years of education. In this case, there is an ordering among groups based on level of educational achievement. As we will see, this ordering may or may not provide information about the dependent variable. Factor is another term used for an unordered categorical explanatory variable.

For ordered categorical variables, analysts typically assign a numerical score to each outcome and treat the variable as if it were continuous. For example, if we had three levels of education, we might employ ranks and use begin{equation*} EDUCATION = left{ begin{array}{cl} 1 & textrm{for low education} \ 2 & textrm{for intermediate education} \ 3 & textrm{for high education.} \ end{array} right. end{equation*} An alternative would be to use a numerical score that approximates an underlying value of the category. For example, we might use begin{equation*} EDUCATION = left{ begin{array}{cl} 6 & textrm{for low education} \ 10 & textrm{for intermediate education} \ 14 & textrm{for high education.} \ end{array} right. end{equation*} This gives the approximate number of years of schooling that individuals in each category completed.

The assignment of numerical scores and treating the variable as continuous has important implications for the regression modeling interpretation. Recall that the regression coefficient is the marginal change in the expected response; in this case, the (beta) for education assesses the increase in E (y) per unit change in EDUCATION. If we record EDUCATION as a rank in a regression model, then the (beta) for education corresponds to the increase in E (y) moving from EDUCATION=1 to EDUCATION=2 (from low to intermediate); this increase is the same as moving from EDUCATION=2 to EDUCATION=3 (from intermediate to high). Do we want to model this increase as the same? This is an assumption that the analyst makes with this coding of EDUCATION; it may or may not be valid but certainly needs to be recognized.

Because of this interpretation of coefficients, analysts rarely use ranks or other numerical scores to summarize unordered categorical variables. The most direct way of handling factors in regression is through the use of binary variables. A categorical variable with c levels can be represented using c binary variables, one for each category. For example, suppose that we were uncertain about the direction of the education effect and so decide to treat it as a factor. Then, we could code c=3 binary variables: (1) a variable to indicate low education, (2) one to indicate intermediate education and (3) one to indicate high education. These binary variables are often known as dummy variables. In regression analysis with an intercept term, we use only c-1 of these binary variables; the remaining variable enters implicitly through the intercept term. By identifying a variable as a factor, most statistical software packages will automatically create binary variables for you.

Through the use of binary variables, we do not make use of the ordering of categories within a factor. Because no assumption is made regarding the ordering of the categories, for the model fit it does not matter which variable is dropped with regard to the fit of the model. However, it does matter for the interpretation of the regression coefficients.

[raw] [/raw]