9 Categorical Data

We generally think of data as a collection of “measurements,” in a loose sense of the word “measurement”. In this loose sense, there are two basic types of “measurement”, measurements on continuous scales, and measurements on categorical scales. (In ordinary speech the word “measurement” often implies a continuous scale.)

Continuous measurements can be represented by a point on a number line, are well-ordered, and in principle can take on one value of an infinite set of choices. Think of a variable like age, which varies continuously from 0 to higher numbers, and where there is a unique order to the ages represented by 1, 5.5, and 55. In principle we can measure age with arbitrary precision - 5.5, or 5.5001 or 5.5000001. The scale here might be measured in days or years, but in any case it is continuous.

Categorical measurements can be represented by arbitrary labels (maybe numerals, maybe character strings), have no conceptual order, and take one value from a finite set of choices. Think of a variable like state of residence, which takes one of about 50 values (Washington, D.C.? Territories?), which have no inherent order.

(Interval and ordinal measurements may be thought of, and are often treated, as continuous measurements with limited precision.)

The distinction between continuous and categorical variables is fundamental to how we use them the analysis. In a regression for example, continuous variables give us slopes and curvature terms, where categorical variables give us intercepts.

In R, it is convenient to manage categorical data as factors. In software like Stata, SAS, and SPSS, we specify which variables are categorical when we call an analytical procedure like regression - no special distinction is made when we are managing or storing our data. In R, we specify which variables are factors when we create and store them - in an analytical procedure we need make no additional specification to distinguish levels of measurement.

In R, a factor refers to a class of data stored in numeric form, usually with some sort of values labels. The numbers (integers) merely represent distinct categories, with no meaningful order to the categories.

For example, we might have a data set where ‘1’ means Green Bay, ‘2’ means Madison, and ‘3’ means Milwaukee.

As with Date class data, we will seldom need to manipulate the underlying integers, we will mainly work with the “human-readable” value labels.

The basic constructor function for data with class ‘factor’ is factor(). For example, we can begin with a character vector of city names, and use factor() to construct a factor from this.

city <- c("Madison", "Milwaukee", "Green Bay")
[1] "Madison"   "Milwaukee" "Green Bay"
x <- factor(city)
[1] Madison   Milwaukee Green Bay
Levels: Green Bay Madison Milwaukee

Notice that factors print differently than character data - no quotes.

9.1 Factors in Generic Functions

In addition to printing slightly differently than character data, in generic functions that take numeric inputs, factors are treated differently as well. Three functions that give different output with factors (versus a numeric vector) are summary(), plot(), and lm().

We can look at the example data set chickwts, which includes both a numeric variable and a factor variable. We learn from help(chickwts) that this data set was created from an experiment testing the effect of different feeds on chicken weights.

Using the summary function, the factor feed produces a frequency table, rather than the six number summary produced by weight.

str(chickwts)   # "weight" is numeric, "feed" is categorical
'data.frame':   71 obs. of  2 variables:
 $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
 $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
  weight      feed
1    179 horsebean
2    160 horsebean
3    136 horsebean
4    227 horsebean
5    217 horsebean
6    168 horsebean

     weight             feed   
 Min.   :108.0   casein   :12  
 1st Qu.:204.5   horsebean:10  
 Median :258.0   linseed  :12  
 Mean   :261.3   meatmeal :11  
 3rd Qu.:323.5   soybean  :14  
 Max.   :423.0   sunflower:12  

In plots, a factor produces a categorical x-axis, and a boxplot rather than a scatter plot.

plot(weight ~ feed, data=chickwts)

In modeling, a factor is used as a categorical variables, generating a set of dummy variables and a set of parameters, rather than a single parameter.

summary(lm(weight ~ feed, data=chickwts))

lm(formula = weight ~ feed, data = chickwts)

     Min       1Q   Median       3Q      Max 
-123.909  -34.413    1.571   38.170  103.091 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    323.583     15.834  20.436  < 2e-16 ***
feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
feedsoybean    -77.155     21.578  -3.576 0.000665 ***
feedsunflower    5.333     22.393   0.238 0.812495    
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.85 on 65 degrees of freedom
Multiple R-squared:  0.5417,    Adjusted R-squared:  0.5064 
F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10

Here, all the categorical parameters are named with the prefix “feed”.

9.2 Logical Comparisons and Math Operators

Logical comparisons are made with the value labels (which are character strings), not the underlying integer codes. Only some logical operators are allowed with factors, namely those based on equality.

rs <- sample(chickwts$feed, 7)
[1] sunflower horsebean soybean   sunflower sunflower soybean   horsebean
Levels: casein horsebean linseed meatmeal soybean sunflower
rs == "casein"
rs == 1  # no error message, but WRONG!
rs > "casein" # error
Warning in Ops.factor(rs, "casein"): '>' not meaningful for factors

Notice that if we try to check for a numeric value, the numeral is treated as if it were a label and not the underlying data! It would be nice if this at least gave us a warning!

In a similar manner, we will not be doing any math at all with categorical data.

rs + 1
Warning in Ops.factor(rs, 1): '+' not meaningful for factors
Warning in mean.default(rs): argument is not numeric or logical: returning NA
[1] NA

9.3 Levels and Labels

The factor() function has two key arguments: levels and labels.

There is a crucial difference between these two arguments. The levels parameter sets the order. The labels parameter assigns value labels to the “given” order. If no levels are specified, the order is sorted alphabetically (which is seldom what we want).

As an example, we can make a new copy of the feed variable with the levels reordered (“re-leveled”, this is also called “re-factoring”):

chick <- chickwts          # first, create a working copy
chick$feed2 <- factor(chick$feed, levels=
                           c("horsebean", "linseed", "soybean",
                             "meatmeal", "casein", "sunflower"))
plot(weight ~ feed2, data=chick, main="Reordered levels")

If we attempt to use the labels argument, we merely relabel the data (which in this case is WRONG!):

chick$feed3 <- factor(chick$feed, labels=
                           c("horsebean", "linseed", "soybean",
                             "meatmeal", "casein", "sunflower"))

opar <- par(mfcol=c(2,1))
plot(weight ~ feed, data=chick, main="Original")
plot(weight ~ feed3, data=chick, main="Bad labels")


Think of levels as an “input” function, specifying what numbers to use to encode categories. Then labels is an “output” function, specifying how the underlying data should be displayed.

One other useful function here, to slightly reorder the levels so that we have a new category in the first position, is the relevel() function.

chick$feed4 <- relevel(chick$feed, "sunflower")
plot(weight ~ feed4, data=chick, main="New first level")

In practice, there are many different ways we might want to reorder (recode) the levels of a factor. The package forcats from the tidyverse collection of packages has quite a few useful functions for this.

9.4 Collapsing and Dropping Categories

To combine several categories of an existing factor into one category, we relabel them.

While we use the labels to combine categories, notice that the underlying integers are also recoded.

chick$feed5 <- factor(chick$feed, labels=
                           c("A", "A", "A",
                             "B", "B", "B"))
plot(weight ~ feed5, data=chick)

 Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...

If you have a lot of categories, but only a few to combine, the forcats function fct_recode() is very useful.

To drop categories, just do not mention them among the input levels. The data is encoded <NA>, the missing value.

chick$feed6 <- factor(chick$feed, levels=
                           c("horsebean", "casein"))
plot(weight ~ feed6, data=chick)

table(chick$feed, chick$feed6, useNA="ifany")
            horsebean casein <NA>
  casein            0     12    0
  horsebean        10      0    0
  linseed           0      0   12
  meatmeal          0      0   11
  soybean           0      0   14
  sunflower         0      0   12

9.5 A confusing function name

Just to make things confusing, the levels() function returns and sets the level labels

chick$feed2 <- chick$feed
levels(chick$feed)  # what we have
[1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"
# sets WRONG! labels again
levels(chick$feed2) <-c("horsebean", "linseed", "soybean",
                          "meatmeal", "casein", "sunflower")
plot(weight ~ feed2, data=chick, main="Bad labels again")

9.6 Categorical Vector Exercises

  1. Convert Numeric Codes to Factors

    In the mtcars data, all the variables are numeric. Convert vs to a factor, where 0 has the label “V-shaped” and 1 has the label “Straight”. How are the levels encoded, numerically?

    Do the same for am, giving 0 the label “automatic” and 1 the label “manual”.

  2. Extract and Convert Row Names

    Again in mtcars, take the row.names and extract the make of each car type (the first word in the row name). Convert this into a factor.

    Produce a frequency count of each car maker.

    Plot car weight (wt) over car make.

    Re-level so that Maserati is the first category, then replot.

  3. Superfluous Levels

    If is quite possible to have a factor defined with extra levels, levels for which you have labels but no actual data. Here is an example:

    x <- sample(c("A", "B"), 20, replace=TRUE)
    xf <- factor(x, levels=c("A", "B", "C"))

    Produce a table of frequency counts for xf.

    Produce a bar chart of counts, plot(xf).

    For most analyses, we would rather drop the extra category. The simplest approach is to refactor, without specifying levels or labels.


    xf2 <- factor(xf)

    Then retabulate and replot.