When you start work with a data set, your first goals are to understand the data set and to clean it up. Logically these are two separate processes, but in practice they are intertwined: you can’t clean your data set until you understand it at some level, but you won’t fully understand it until it’s clean. Thus this section will cover both.
We’ll also introduce a lot of data science concepts in this section. This makes for lengthy discussions of tasks that in practice you can complete very quickly.
In this section, we’ll primarily use the file 2000_acs_sample.dta. This file started as the 1% unweighted sample of the 2000 American Community Survey available from IPUMS, but then we took a 1% random sample of the households in that dataset just to make it easier to work with. This data set uses the “source” variables directly from the Census Bureau rather than the “harmonized” variables created by IPUMS, which are much cleaner. For real work you can usually use the harmonized variables, but we’re here to learn how to do the kinds of things IPUMS does to create them.
By using this data set in the examples, you’ll also gain some basic understanding of the U.S. population (as of 2000) and some experience with a real and important data set, but you would not want to use this particular data file for research.
Setting Up
Start a do file that loads 2000_acs_sample.dta, and make it a proper do file (i.e. it keeps a log, starts with a blank slate, etc.). Call it first_steps_examples.do.
. capture log close
. log using first_steps_examples.log, replace
-------------------------------------------------------------------------------
name: <unnamed>
log: C:\Users\rdimond\dws\first_steps_examples.log
log type: text
opened on: 23 Dec 2022, 16:23:31
.
. clear all
. use 2000_acs_sample
.
Use this do file for the example code in this chapter and those exercises that involve working with the 2000_acs_sample.dta data set. Some exercises ask you to work with other data sets, so you’ll make a separate do file for those.
4.1 Read the Documentation
When you download a data set, you’ll be tempted to open it up and go to work right away. Resist! Time spent reading the data set’s documentation (assuming there is some) can save you much more time down the road. Data providers may give you files containing documentation along with the data itself, or it may be on their web site. Feel free to skim what’s not relevant to you–this section will give you a better sense of what information is most important.
Unfortunately, not all data sets have good documentation, or any documentation at all, so figuring out the nature of a data set by looking at the data set itself is a vital skill. You also can’t assume that the documentation is completely accurate, so you need to check what it says.
The ACS has lots of good documentation, but for practice we’ll make minimal use of it (just the codebook) and figure out everything we can for ourselves. We’d still do all the same things if we were using the documentation, we’d just understand what we were looking at much more quickly.
4.2 Identify the Variables
The describe command will give you basic but useful information about your data set:
describe
Contains data from 2000_acs_sample.dta
Observations: 28,172
Variables: 16 8 Apr 2019 16:43
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
year int %8.0g year_lbl Census year
datanum byte %8.0g Data set number
serial double %8.0g Household serial number
hhwt double %10.2f Household weight
gq byte %43.0g gq_lbl Group quarters status
us2000c_seria~o str7 %9s Housing/Group Quarters (GQ) Unit
Serial Number
pernum int %8.0g Person number in sample unit
perwt double %10.2f Person weight
us2000c_pnum str2 %9s Person Sequence Number
us2000c_sex str1 %9s Sex
us2000c_age str2 %9s Age (original)
us2000c_hispan str2 %9s Hispanic or Latino Origin
us2000c_race1 str1 %9s Race Recode 1
us2000c_marstat str1 %9s Marital Status
us2000c_educ str2 %9s Educational Attainment
us2000c_inctot str7 %9s Person's Total Income in 1999
-------------------------------------------------------------------------------
Sorted by: serial pernum
The primary goal of running describe is to see what variables you have and what they’re called. The variable labels will help you understand what they mean. But it will frequently also let you start a “to do list” of issues you need to address before analyzing the data. Here are some issues brought out by running describe on this data set:
The data set seems to have an excess of identifiers: serial, pernum and us2000c_pnum.
Since you’re using a single data set from a single year, you don’t need year and datanum to tell you where each observation comes from.
For the same reason, you also don’t need us2000c_ (US 2000 Census) in your variable names to tell you where the variables come from.
You have both household weight (hhwt) and person weight (pwt) variables even though this is supposed to be an unweighted sample.
pnum, and all of the us2000c_ variables are stored as strings of various lengths, even though some of them are clearly intended to be numeric variables.
Another issue to watch out for is the storage type of identifier variables. The default variable type, float, has seven digits of accuracy. Identifiers that are more than seven digits long must be stored as double (numbers with 16 digits of accuracy), as long (integers with up to 10 digits), or as strings (text with up to a billion characters). In this data set serial is stored as double and us2000c_pnum as string, so those are fine. The pernum variable, on the other hand, is an int (integer), so it can only have five digits. (This is our first hint that pernum by itself is not a unique identifier.)
Note which variables have value labels: with those variables, what you see in output will be the labels and not the underlying values. You need to use the values in your code. You can see what all the underlying values are with label list. This data set makes the interesting choice of applying value labels to year which are identical to the values themselves. (Most likely there are some circumstances where they apply a different label.) You can also list a single label using the label names you get from describe:
labellist gq_lbl
gq_lbl:
0 Vacant unit
1 Households under 1970 definition
2 Additional households under 1990 definition
3 Group quarters--Institutions
4 Other group quarters
5 Additional households under 2000 definition
6 Fragment
This is a more typical set of value labels. Looking at this definition tells you that if you wanted to limit your analysis to “Households under 1970 definition” the code for that restriction would be if gq==1.
Exercise 1
Carry out the same steps with file 2000_acs_sample_harm.dta (create a new do file that loads the data, then runs describe and label list.) This is a similar sample but with the IPUMS “harmonized” variables. What issues did IPUMS resolve? What issues remain?
When you’re done, run first_steps.do again so 2000_acs_sample is the data set in memory.
4.3 Look at the Data
Unless your data set is very small, you can’t possibly read all of it. But just looking at a few rows may allow you to immediately spot patterns that would be difficult to detect using code. In this case, opening the data browser (type browse or click the button that looks like a magnifying glass over a spreadsheet) will show the following:
year and datanum seem to always have the same value, suggesting that we don’t need them.
hhwt and perwt (household and person weights) seem to always be 100.00, which makes sense given that this is supposed to be an unweighted sample.
pernum and us2000c_pnum appear to be identical other than us2000c_pnum being a string and having leading zeros.
pernum seems to count observations, starting over from 1 every time serial changes.
All of the string variables contain mostly numbers.
us2000c_sex, us2000c_hispan, us2000c_race1, and us2000c_marstat are clearly describing categories, but even though they are string variables they only contain numbers. We will have to refer to the codebook to find out what the numbers mean. (This also applies to us2000c_educ, it’s just not as obvious at this point.)
us2000c_inctot sometimes has “BBBBBBB”, which is certainly an odd value for income.
You can’t be sure that these patterns hold the for entire data set until you check, but now you know some things to check for. For example, you can run tab year and tab datanum to confirm they always contain the same value.
Exercise 2
Run your first_steps_exercises do file that loads 2000_acs_sample_harm.dta and then look at it in the data browser. What issues do you notice? Be sure to look closely at inctot.
When you’re done rerun first_steps_examples.do.
4.4 Find the Identifiers and Figure Out the Data Structure
Imagine you wanted to call someone’s attention to a particular value in this data set. How could you tell them exactly which row and column to look at?
4.4.1 Observation Numbers
One way to tell them which row to look at is to simply tell them the observation number. Observation numbers are tracked by the system variable _n, which you can use in commands as if it were any other variable. If you are using by the variable _n will take that into account, meaning it will start over from 1 in every by group. We’ll use that heavily later.
The trouble with observation numbers is that many common tasks change them: sorting, for example, or dropping or adding observations. Thus it’s far better to use one or more identifier variables. If the data set does not contain identifier variables, you can create one based on the current observation numbers with gen id = _n. Then the variable id will not change even if the observation numbers do.
4.4.2 Identifier Variables
Identifier variables, also known as primary keys or indexes, are variables whose purpose is to identify observations. They normally do not contain information. If you had a data set where each row was a UW-Madison student, their UW-Madison ID number would be a unique identifier: specifying a UW-Madison ID number allows you to identify a single row.
Now imagine a data set describing UW-Madison students, but there is one row for each class the student is currently taking, with each class being identified by a number. In order to identify a specific row you now need to specify both a UW-Madison ID number and a class number. The combination of ID number and class number would be a compound identifier for the data set, and the combination of the two is a unique identifier.
Finding the identifier variables in a data set will help you understand the structure of the data. If student ID is a unique identifier, you know you have one row per student and an observation represents a student. If student ID and class number are a unique identifier, you know you have one row per student per class, and each observation represents a class/student combination. In practice you might refer to an observation as a class, as long as everyone understands that two students taking the same class will have two rows, not one.
The duplicates report command can easily identify whether a variable or set of variables is a unique identifier: if it is, there will be no duplicates. The ACS data has a variable called pernum (Person number in sample unit) is it a unique identifier?
Clearly not: only one observation can be uniquely identified using pernum. Running tab on an identifier wouldn’t normally be useful because there will be too many values, but the large number of duplicates suggests it might be fruitful here:
Seeing the values, it’s clear what this variable is: the person’s identifier within their household. So pernum is not an unique identifier by itself, but it is part of a compound identifier that is. Try:
duplicatesreport serial pernum
Duplicates in terms of serial pernum
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 28172 0
--------------------------------------
There are no duplicates so we now know that serial and pernum are a compound identifier for this data set. The serial variable identifies a household and pernum identifies a person with in that household. We also know that each observation represents a person, and the people are grouped into households.
4.4.3 Column Identifiers
In Stata, columns are identified by variable names. Variable names are always unique (Stata won’t allow you to create two variables with the same name) but they often have multiple parts. In this data set, some of the variable names are in the form source_variable, like us2000c_sex and us2000c_age. It can be very useful to think of such variable names as a compound identifier with two separate parts (e.g. source and variable). It’s even possible to convert row identifiers to part of a compound column identifier and vice versa–we’ll learn how later.
4.4.4 Using Identifiers
Stata does not give identifiers any special status, so you use them like any other variable. If you want to see observation 5, run:
Another way to identify the value of a specific variable for a specific observation number is to put the observation number in square brackets after the variable name.
display us2000c_age[5]
29
This square bracket syntax can be used in mathematical expressions, which is very useful.
Note: while the list command lists your data, the display command prints out a single thing. This can be the result of evaluating a mathematical expression (display 2+2) or a message (display "Completed discussion of identifiers"), both of which are quite useful.
4.4.4.1 Exercise 3
Have first_steps_exercises.do open the data set atus.dta. This is a selection from the American Time Use Survey, which measures how much time people spend on various activities. Find the identifiers in this data set. What does an observation represent? What was the first activity recorded for person 20170101170012?
Be sure to go back to first_steps_examples.do when you’re done.
4.5 Get Rid of Data You Won’t Use
Understanding data takes time. Even skipping past data you don’t care about to get to what you do care about takes time. So if you won’t use parts of a data set, get rid of those parts sooner rather than later. Doing so will also reduce the amount of memory needed to analyze your data and the amount of disk space needed to store it, and make anything you do with it run that much faster. If you change your mind about what you need, you can always change your do file later.
The drop command can drop either variables or observations, depending on whether it is followed by a variable list or an if condition. To drop year and datanum, run:
dropyear datanum
Let’s declare that you don’t want to include individuals living in “group quarters” (prisons, for example) in your analysis. Drop them with:
dropif gq==3 | gq==4
(762 observations deleted)
The keep command works in the same way, but in the opposite sense: running keep year datanum would drop all the variables except year and datanum. Running keep if gq==3 | gq==4 would drop everyone except those in group quarters. If you start with a data set that contains thousands of variables but only intend to use a dozen (very common), a keep command that cuts the data set down to just that dozen is an excellent starting point. On the other hand, if those variables have names that are codes rather than anything meaningful (also very typical in large data sets), consider renaming them first so you only have to use the codes once.
4.6 Change Variable Names
A good variable name tells you clearly what the variable contains. Good variable names make code easier to understand, easier to debug, and easier to write. If you have to choose between making a variable name short and making it clear, go with clear.
Variable names cannot contains spaces, but many variable names should contain multiple words. There are two competing conventions for making muti-word variable names readable. Camel case capitalizes the first letter of each word after the first: householdIncome, mothersEducation, etc. Snake case uses underscores instead of spaces: household_income, mothers_education, etc. We suggest snake case because it’s easier to read, but which one you use is less important than that you choose one and stick with it. Don’t force yourself to remember whether you called your variable householdIncome or household_income this time!
When you use abbreviations use the same abbreviation every time, even across projects. This data set abbreviates education as educ, and there’s nothing wrong with that, but if you use edu in your other projects that’s sufficient reason to change it.
The syntax to change a single variable name is just rename old_name new_name. Change pernum to person with:
rename pernum person
The rename command can rename groups of variables as well. We’re not planning to combine our example data with other data sets, so we don’t need the us2000c_ at the beginning of many of the variable names to tell us which data set the variable came from. You can remove it with:
rename us2000c_* *
This can be read as “rename all the variables that match the pattern us2000c_ followed by something to just the something.” Type help rename group to see the various ways you can rename groups of variables.
Many of the resulting variable names are easy to remember once you know what they are but cryptic on a first reading. We’ll make them clearer in the exercise.
4.6.1 Exercise 4
Rename race1 to race, serial to household, educ to edu, hispan to hispanic, marstat to marital_status, and inctot to income.
This exercise is part of the workflow of cleaning up our ACS sample, so do it in first_steps_examples.do. We’ll also include the solution here since it needs to be part of the workflow of the Jupyter Notebook for this chapter, but don’t read it until you’ve tried to do it yourself.
rename race1 racerename serial householdrename educ edurename hispan hispanicrename marstat marital_statusrename inctot income
. rename race1 race
. rename serial household
. rename educ edu
. rename hispan hispanic
. rename marstat marital_status
. rename inctot income
.
The file first_steps1.dta contains the data set as it should be at this point. If you haven’t carried out all the steps, you should have your do file load that data set before proceeding.
4.7 Convert String Variables that Contain Numbers to Numeric Variables
Unfortunately, it’s very common for numeric variables to be imported into Stata as strings. Before you can do much work with them they need to be converted into numeric variables.
The destring command is the easy way to convert strings to numbers. Just give it a varlist to act on, and the replace option to tell it it can replace the existing string variables with the new numeric versions. You have some of these in the ACS sample; destring the ones you’re interested in with:
destring sex age hispanic race marital_status edu income, replace
sex: all characters numeric; replaced as byte
age: all characters numeric; replaced as byte
hispanic: all characters numeric; replaced as byte
race: all characters numeric; replaced as byte
marital_status: all characters numeric; replaced as byte
edu: all characters numeric; replaced as byte
income: contains nonnumeric characters; no replace
Before carrying out a conversion, Stata checks that all the string variable’s values can be successfully converted to numbers. In the case of income, Stata found that some values could not be. Technically this is not an error (your do file did not crash) but income was not converted. To explore why, use the gen() option to create a new variable containing the numeric values, and the force option to tell Stata to proceed with the conversion despite the values that cannot be converted.
destring income, gen(income2) force
income: contains nonnumeric characters; income2 generated as long
(6144 missing values generated)
The force option is so named because you’re forcing Stata to do something it doesn’t think is a good idea. This should make you nervous! Only use a force option if you’re very confident that you understand both why Stata thinks what you’re doing might be a bad idea and why it’s okay in your case.
With both income (the original string variable) and income2 (the new numeric version) available to you, you can compare them and figure out what’s going on. Observations with values of income that could not be converted have missing values for income2, so get a list of the values of income that could not be converted with tab:
tab income if income2==.
Person's |
Total |
Income in |
1999 | Freq. Percent Cum.
------------+-----------------------------------
BBBBBBB | 6,144 100.00 100.00
------------+-----------------------------------
Total | 6,144 100.00
There’s only one value that couldn’t be converted, “BBBBBBB”. This is a code for missing, so it should be converted to a missing value: destring did exactly what you want it to do. Now that you know that, you can go back and just convert income using the force option rather than creating a separate income2 variable. But you want to keep a record that you checked and know that that’s okay to do.
Go back in your do file and comment out (i.e. turn into comments) both the destring command that created income2 and the tab command that checked it against income. Add text that explains what you found and why that means you don’t have to worry about the values that cannot be converted. Then add a destring command that converts income to a numeric variable using the replace and force options. Rerun your do file so that the data set no longer contains income2 and income is numeric.
(This chapter can’t go back and change its workflow so it needs to drop income2, but your modified do file should not create it.)
destring income, forcereplacedrop income2
. destring income, force replace
income: contains nonnumeric characters; replaced as long
(6144 missing values generated)
. drop income2
.
There’s also a function that converts strings to numbers, called real() (as in real numbers). You could have created the income2 variable by running gen income2=real(income). The advantage of real() is that you can use it as part of an expression.
The assert command verifies that a condition you give it is true for all observations, making it very useful for checking all sorts of things. We’ve been reasonably confident that person and pnum are exactly the same (other than pnum being a string) for a long time, but now you can find out for sure:
assertreal(pnum)==person
This command asserts that pnum, once converted to a number, is always the same as person. Stata does not complain, which means it agrees. If the assertion were not true, Stata would tell you how often the assertion is false–and then crash your do file. This is a good thing, because if you write a do file that assumes that some condition is true, it’s far better for that do file to crash and tell you your assumption is wrong than for it to keep running and give you incorrect results.
Now that you know you don’t need pnum, drop it. :
drop pnum
4.8 Identify the Type of Each Variable
The most common variable types are continuous variables, categorical variables, string variables, and identifier variables. Categorical variables can be further divided into unordered categorical variables, ordered categorical variables, and indicator variables. (There are other variable types, such as date/time variables, but we’ll focus on these for now.) Often it’s obvious what type a variable is, but it’s worth taking a moment to consider each variable and make sure you know its type.
Continuous variables can, in principle, take on an infinite number of values. They can also be changed by very small amounts (i.e. they’re differentiable). In practice, all continuous variables must be rounded, as part of the data collection process or just because computers do not have infinite precision. As long as the underlying quantity is continuous, it doesn’t matter how granular the available measurements of that quantity are. You may have a data set where the income variable is measured in thousands of dollars and all the values are integers, but it’s still a continuous variable.
Continuous variables are sometimes referred to as quantitative variables, emphasizing that the numbers they contain actually correspond to some quantity in the real world. Thus it makes sense to do math with them.
Categorical variables, also called factor variables, take on a finite set of values, often called levels. The levels are typically stored as numbers (1=white, 2=black, 3=hispanic, for example), but it’s important to remember that the numbers don’t actually represent quantities. Categorical variables can also be stored as strings.
With unordered categorical variables, the numbers assigned are completely arbitrary. Nothing would change if you assigned different numbers to each level (1=black, 2=hispanic, 3=white). Thus it makes no sense to do any math with them, like finding the mean.
With ordered categorical variables, the levels have some natural order. Likert scales are examples of ordered categorical variables (e.g. 1=Very Dissatisfied, 2=Dissatisfied, 3=Neither Satisfied nor Dissatisfied, 4=Satisfied, 5=Very Satisfied). The numbers assigned to the levels should reflect their ordering, but beyond that they are still arbitrary: you could add 5 to all of them, or multiply them all by 2, and nothing would change. You will see people report means for ordered categorical variables and do other math with them, but you should be aware that doing so imposes assumptions that may or may not be true. Moving one person from Satisfied to Very Satisfied and moving one person from Very Dissatisfied to Dissatisfied have exactly the same effect on the mean, but are you really willing to assume that they’re equivalent changes?
Indicator variables, also called binary variables, boolean variables or dummy variables, are just categorical variables with two levels. In principle they can be ordered or unordered but with only two levels it rarely matters. Often they answer the question “Is some condition true for this observation?” Occasionally indicator variables are referred to as flags, and more commonly flagging observations where a condition is true means to create an indicator variable for that condition.
String variables contain text. Sometimes the text is just labels for categories, and they can be treated like categorical variables. Other times they contain actual information.
Identifier variables help you find observations rather than containing information about them, though some compound identifiers blur the line between identifier variables and categorical variables. They may look like continuous variables because they have so many unique values, but you’ll probably find the code you use for categorical variables to be more useful with them.
A useful tool for identifying variable types is the codebook command. It produces a lot of output (there’s a reason we covered dropping unneeded variables first) and you can skim much of it.
codebook will try to guess the type of each variable. For continuous variables, it will give you the mean, standard deviation, and percentiles. For categorical variables (including strings it thinks are categorical), it will give frequencies. For string variables it thinks are not categorical, it will give you some examples.
Run codebook and look over the output. We won’t include it here, but some things to note:
It gave summary statistics for household (the household identifier formerly known as serial) and person (the person identifier formerly known as pernum) because it guessed they were continuous variables. This is of course nonsense and can be ignored. The codebook command gives useful output, but not all codebook output is useful.
hhwt and perwt really are always 100, so now you know for sure you can drop them.
gq is now down to two values, making it an indicator variable. But we don’t care about the distinction between the 1970 and 1990 definitions of household so we’ll just drop it.
sex is an indicator variable coded 1 and 2. We’ll have to look up what those numbers mean.
age has 92 unique values and a range that looks like actual years, so we can be confident it’s a continuous variable.
You might think hispanic would be an indicator variable, but with 22 unique values it must be a categorical variable.
race and maritalStatus are categorical. Again, we’ll have to look up what the numbers mean.
With 17 unique values and examples that are plausible numbers of years in school, edu could be a quantitative variable. But the mean and median are low. We’ll consider this variable more closely.
With 2,614 unique values and values that look like plausible incomes we can be confident income is a continuous variable.
Given what we’ve learned, drop hhwt, perwt, and gq:
drop hhwt perwt gq
4.8.1 Exercise 5
Run first_steps_exercises.do to load the ATUS and then run codebook. Identify the variable type of famincome, hispan, asian. How are hispan and asian different from each other?
Then rerun first_steps_examples.do.
4.9 Recode Indicator Variables
It is highly convenient to frame indicator variables as telling you if something is true or not, with “true” coded as 1 and “false” as zero. In this data set the variable sex has the levels 1 and 2. Which are the males and which are the females? We’ll have to refer to the codebook to find out. Now consider a variable called “female” coded with 1 and 0. To anyone familiar with the convention that 1 means true and 0 means false, no further documentation is required. This also allows you to write intuitive code like if female and if !female (but only if there are no missing values).
IPUMS provided a codebook file for this dataset, 2000_acs_codebook.txt (note that this is not the same as the output of Stata’s codebook command). On line 112, you’ll see the coding for sex: 1 is Male and 2 is Female. Create a variable called female instead with:
gen female = (sex==2)
Recall that if you set a variable equal to a condition, it will get a 1 if the condition is true and a 0 if the condition is false.
This command relies on the fact that sex is never missing in this data set (codebook told you that). If you had missing values for sex, you’d make sure female was also missing for those observations with:
gen female = (sex==2) if sex<.
It also relies on the fact that gender is binary in this data set: anyone who is not female is male. That is likely to become less common over time.
A good way to check your work for these kinds of tasks is to run a crosstab of the new and old variables. All the observations should be in table cells that make sense (e.g. 1 for sex and 0 for female) and none should be in the table cells that don’t make sense (e.g. 2 for sex and 0 for female). Be sure to include missing values:
tab sex female, miss
| female
Sex | 0 1 | Total
-----------+----------------------+----------
1 | 13,326 0 | 13,326
2 | 0 14,084 | 14,084
-----------+----------------------+----------
Total | 13,326 14,084 | 27,410
Now that you’re confident female is coded correctly, drop sex:
drop sex
Exercise 5
Create an indicator variable for “this person is hispanic.” Look in 2000_acs_codebook.txt to see which values of the existing hispanic variable identify someone as hispanic. Call the new variable hisp at first. Check your work by running a crosstab of hisp and hispanic, then drop hispanic and rename hisp to hispanic so in the end you have a variable called hispanic that is an indicator.
This exercise and all the remaining exercises in this chapter are part of the workflow of cleaning up the ACS, so do them in first_steps_examples.do and the solutions will be given here (but don’t read them until you try to do them yourself).
Labels make a data set easier to use and understand–you’ve seen the benefits of variable labels as you’ve worked to understand this data set. Value labels can be even more useful by telling you what categorical variables mean: right now variables like race are useless without referring to the codebook.
Labels are set using the label command. The label command has many subcommands, like the label list you’ve used already.
4.10.1 Set Variable Labels
You can set variables labels with the label variable command: just specify the variable to be labeled and the text to label it with.
Variables like female are arguably clear enough that they don’t need labels, but set one for it anyway just for practice:
labelvariable female "Person is female"
Exercise 6
Set a similar variable label for hispanic (do not switch do files).
Solution:
labelvariable hispanic "Person is hispanic"
4.11 Set Value Labels
Value labels are a mapping from a set of numbers, the levels of your categorical variables, to a set of text labels that tell you what each level means. The first step in using them is to define the mapping using label define:
These meanings come from the codebook file, line 246. Note how the use of multiple lines and indentation makes this command easier to read.
The next step is to label the values of one or more variables with the mapping defined. This is done with the label values command:
labelvalues marital_status marital_status_label
If you wanted to apply this mapping to more than one variable you’d just list them all before ending with the name of the label.
See the results by running:
tab marital_status
Marital |
Status | Freq. Percent Cum.
--------------+-----------------------------------
Now married | 11,643 42.48 42.48
Widowed | 1,405 5.13 47.60
Divorced | 2,177 7.94 55.55
Separated | 435 1.59 57.13
Never married | 11,750 42.87 100.00
--------------+-----------------------------------
Total | 27,410 100.00
If a mapping will only apply to one variable, it may be convenient to give the mapping the same name as the variable. This makes for a label values command that will look confusing to new Stata users, which is why we don’t do it here (e.g. label values marital_status marital_status), but makes it very easy to remember the name of the label mapping for each variable.
You also need to remember the underlying values so you can use them in your code. In most cases this will come naturally as you work with the data. But if you don’t mind somewhat uglier output you can put the numbers directly in the labels (e.g. “1: Now married” or “Now married (1)”).
In the process of setting value labels you’ll discover something important about edu: it is a categorical variable, not a quantitative variable meaning “years of school.” Of course reading the documentation before getting started would have told you this long ago.
Exercise 7
Set value labels for race, using the codebook to find the meaning of each level. You may want to shorten the descriptions. Hint: 5 really means “They checked the box for American Indian or Alaska Native, but they didn’t specify a tribe so we don’t know if they’re an American Indian or an Alaska Native.”
edu also needs value labels badly: look in the codebook and you’ll see it is categorical, not quantitative. Since setting them is kind of tedious, you have one-time permission to copy and paste this code into your do file rather than typing it yourself. If you put your mouse over the code block a clipboard button will appear on the right. Click it and the code will be copied into your clipboard.
Alternatively, first_steps2.dta contains the data as it should be at this point, and you can have your do file load it before proceeding.
4.12 Examine Variable Distributions
Understanding the distributions of your variables is important for both data cleaning and analysis.
4.12.1 Continuous Variables
For a continuous variable like income, summarize (sum) is the place to start for understanding its distribution:
sum income
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
income | 21,266 27724.1 39166.4 -10000 720000
Things to note:
These statistics were calculated across 21,266 observations, while the data set has 27,410. This reflects 6,144 missing values (the former “BBBBBBB” codes).
The mean is $27,724.1, which seems low, but keep in mind it includes children, retirees, people who are unemployed, etc.
The minimum value is -10000. We’ll come back to the question of negative incomes.
For analysis purposes, an important question is whether the variable is normally distributed, or even close to it. Add the detail option for some hints:
sum income, detail
Person's Total Income in 1999
-------------------------------------------------------------
Percentiles Smallest
1% 0 -10000
5% 0 -10000
10% 0 -10000 Obs 21,266
25% 6000 -10000 Sum of wgt. 21,266
50% 18000 Mean 27724.1
Largest Std. dev. 39166.4
75% 35800 536000
90% 60000 545000 Variance 1.53e+09
95% 82200 572000 Skewness 4.954469
99% 179000 720000 Kurtosis 41.3295
With a mean that’s much higher than the median (50th percentile) and a skewness of 4.95, income is clearly not normally distributed. You’ll find that almost all income distributions are strongly right-skewed. (As an interesting exception: the distribution of non-faculty staff salaries at UW-Madison is strongly left-skewed.)
A histogram is a great tool for understanding the distribution of continuous variables, and you can create one with the hist command. Use the freq option to have the y-axis labeled in terms of frequency rather than density:
hist income, freq
(bin=43, start=-10000, width=16976.744)
Unfortunately, the outliers drive the scale, making it hard to say much about the distribution of the bulk of the observations. Consider looking at a subset of the incomes:
hist income if income<100000, freq
(bin=43, start=-10000, width=2553.4884)
Exercise 8
Examine the distribution of age. How close is it to normally distributed? Do you see any anomalies? Since age is an integer, run another histogram with the discrete option, which gives you one bin for each value. Any concerns now?
Do this exercise in first_steps_examples.do, but since it doesn’t change the data it doesn’t have to be part of this chapter’s workflow. So we’ll make you look at the back of the book for the solution.
4.13 Categorical Variables
For a categorical variable like edu, tabulate (tab) is the place to start for understanding its distribution:
There are no missing values, or at least none coded as such.
The category “Not in universe” needs some investigation.
There are more people in lower education categories than you might expect (41.7% did not graduate from high school).
There may be more categories here than are useful, so consider combining them.
A bar graph is a great tool for understanding the distribution of categorical variables. Unlike a frequency table, a reader can absorb the information in a bar graph instantly. It may take some work to make them presentable; Bar Graphs in Stata discusses some of the tricks needed. You can see the problem if you try to create the default bar graph for edu:
graphbar, over(edu)
There’s not enough space on the x-axis for all the labels. But this problem has a very simple solution: switch to a horizontal bar graph.
graphhbar, over(edu)
This makes horizontal the format of choice for bar graphs.
The syntax for bar graphs may seem confusing: why the over() option rather than just graph hbar edu? The reason is that bar graphs are also used to examine relationships between variables. A variable list is used to specify the variable that defines the lengths of the bars (by default the mean of that variable), while the over() option specifies the variable that defines the bars themselves, as in:
graphhbar income, over(edu)
We hope the relative lengths of the last two bars do not surprise or disappoint any PhD students reading this.
4.13.1 Exercise 9
Examine the distributions of race, female, and hispanic.
4.14 Investigate Anomalies
We’ve identified several oddities in this data set as we’ve explored it. These could have significant effects on your analysis, so it’s important to figure out what they mean.
edu has a level called “Not in universe,” coded with 0. The Census Bureau probably isn’t actually collecting data on extra-dimensional beings, so what does this mean? Begin by examining the distribution of age for people who have “Not in universe” for edu:
sum age if edu==0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 1,123 .9946572 .8076991 0 2
Everyone with “Not in universe” for education is under the age of three. Is the converse true?
tab edu if age<3
Educational Attainment | Freq. Percent Cum.
-------------------------+-----------------------------------
Not in universe | 1,123 100.00 100.00
-------------------------+-----------------------------------
Total | 1,123 100.00
People with “Not in universe” are all under the age of three, and people under the age of three are always “Not in universe.” The Census Bureau uses “Not in universe” to mean the variable does not apply; in this case it’s because if someone is under the age of three they don’t even ask about their education. “Legitimate skips” are similar: questions a respondent did not answer (skipped) because they did not apply.
This is why the different coding of hispan and asian in the ATUS is somewhat puzzling: hispan is treated as a categorical variable that applies to everyone, with one valid value being “Not Hispanic,” while asian is treated as a categorical variable that only applies to Asian people, with non-Asians being “Not in universe.” Either makes sense, but you’d expect them to be the same.
We also noted that the percentage of people with less than a high school education (41.7%) seemed high for the United States in 2000, but that includes children who simply aren’t old enough to have graduated from high school. If we limit the sample to adults the distribution is much more plausible:
This is a good example of how you should check your work as you go: see if your data matches what you know about the population of interest, and if not, be sure there’s a valid reason why.
We also noted negative values for income. First off, check how many there are:
countif income<0
29
Very few, which lowers the stakes in dealing with them. So who are the people with negative incomes?
sum age if income<0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 29 49.89655 13.45454 24 77
Person is |
female | Freq. Percent Cum.
------------+-----------------------------------
0 | 16 55.17 55.17
1 | 13 44.83 100.00
------------+-----------------------------------
Total | 29 100.00
So they’re all adults, but there are no other obvious patterns. One plausible explanation is that these people had losses on investments. However, in order to lose money on investments you have to have investments. So if that’s the explanation, these people may have substantial wealth even though they have negative incomes in this particular year. This is problematic if you plan to use income as a proxy for broader socio-economic status. You might consider changing negative values of income to missing.
4.14.1 Exercise 9
Who are the people who marked “Other” for race? (Look at the distributions of other variables for the people who have “Other” for race.)
4.15 Recode Values that Mean Missing to Missing Values
Stata uses special codes to indicate missing values. For numeric values there’s the generic missing value, ., plus the extended missing values .a, .b, .c, through .z. (Recall that as far as greater than and less than are concerned, .<.a, .a<.b, etc. so if x<. excludes all missing values of x.) For strings, the empty string, "" (i.e. quotes with absolutely nothing inside, not even a space) means missing. These values receive special treatment. For example, statistical commands exclude missing values from their calculations.
Data sets often use different codes to indicate missing values. We saw that with income: “BBBBBBB” meant missing. This was automatically converted to a missing value when you converted income from a string variable to a numeric variable, but it won’t always be that simple. It’s very common to use negative numbers to mean missing, especially -9. Stata will not recognize that -9 means missing and will include it in statistical calculations, giving incorrect results.
The solution is to identify values that really mean missing, and then change them to missing values. For edu, the value 0, “Not in universe” means missing. Change it with:
replace edu = . if edu==0
(1,123 real changes made, 1,123 to missing)
What about negative values for income? Is that really a code for missing? Take a look with:
The variety of values suggest these are actual quantities, not codes.
4.15.1 Exercise 10
The codebook says that for marital_status 5 means “Never married (includes under 15 years).” In other words, the Census Bureau didn’t ask about the marital status of children under the age of fifteen just like they didn’t ask about the education of children under the age of three. But for marital status they coded it as if it were known that the children were never married. Undo this choice by changing marital_status to missing for children under fifteen.
Solution (given here since it changes the data):
replace marital_status = . if age<15
(6,144 real changes made, 6,144 to missing)
4.16 Examine Missing Data
Once all missing values are coded in a way that Stata can recognize them, the misstable sum command will give you a very useful summary of the missing data in you data set.
misstable sum
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
marital_st~s | 6,144 21,266 | 5 1 5
edu | 1,123 26,287 | 16 1 16
income | 6,144 21,266 | >500 -10000 720000
-----------------------------------------------------------------------------
Any variable that is not listed does not have any missing values. Thus you now know you can write conditions like if age>=65 without worrying about missing values of age being included. You also see how many missing values you have for each variable, and thus how big an issue missing data is in your data set.
What misstable cannot tell you is why the data are missing. Often you can answer this question by examining the relationships between missing values and other variables. For example, who are the people with missing values of income?
sum age if income==.
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 6,144 7.245117 4.315471 0 14
tab income if age<14
no observations
The Census Bureau did not ask about the income of children under the age of 15 just like they did not ask about their marital status.
You should also consider relationships between missing values. In this data set, people with a missing value of income also have a missing value for marital_status because both variables were not collected for children under the age of 15. But in many data sets there are direct relationships between missing values. For example, if a subject could not be located in a wave of a survey then they may have missing values for all the variables for that survey wave.
Data is missing completely at random if the probability of it being missing is unrelated to either the observed data or the unobserved data. Thus the unobserved data has the same distribution as the observed data. Complete cases analysis (analysis of just the observed data) will have less power due to the missing data, but will be unbiased.
Data is missing at random if the probability of it being missing is related only to the observed data (the missingness is random conditional on the observed data.) Thus the unobserved data can be distributed differently from the observed data. Complete cases analysis will be biased, but methods such as weighting or multiple imputation may be able to correct that bias–though they generally depend on being able to make inferences about the unobserved data using the observed data. For example, pollsters know people with lower levels of education are less likely to respond to polls, but they can use the responses of those people with low levels of education who do respond to make inferences about those who don’t (typically by weighting).
Data is missing not at random if the probability of it being missing is related to the unobserved data. You cannot correct for bias due to missing not at random data; worse, you can’t even detect it using the observed data. For example, UW-Madison once did a survey on the use of electronic calendars. The response rate was low and it seemed likely that people who use calendars heavily would be more motivated to respond, meaning estimates of usage would be biased high. But there was no way to know for sure if this was true since the probability of answering the survey would depend on the values of usage that were not observed.
Consider edu, which is missing for children under the age of three. If we could observe the values of edu for these children, they would probably be mostly “None” with perhaps a few “Nursery school-4th grade” (presumably all “Nursery School”). This is very different from the observed distribution of edu. This is an example of “missing at random” since the probability of edu being missing depends on age and age is always observed, but it would be difficult to correct for the bias introduced because we have no actual data (just assumptions) about the distribution of edu for children under three.
Exercise 11
What is the likely distribution of income for people with a missing value for income? How is that different from the distribution of income for those where income is known? How would the mean of income change if all its missing values became known? (In other words, how do the missing values of income bias estimates of the population mean of income?)
At this point the data set is reasonably clean, and what you do with it next will depend on how you plan to use it. For example, if you wanted to use education as a predictor in a regression model it would probably be wise to combine some of the categories, but if you were doing a descriptive study you might leave it as is. Wrap up this do file by saving the cleaned-up data set (never saving it over the original file) and closing the log:
save 2000_acs_clean, replacelogclose
. save 2000_acs_clean, replace
file 2000_acs_clean.dta saved
. log close
name: <unnamed>
log: C:\Users\rdimond\dws\first_steps_examples.log
log type: text
closed on: 23 Dec 2022, 16:23:45
-------------------------------------------------------------------------------
.
// tempsave acs, replacereshapewide age edu race female hispanic income marital_status, i(household) j(person)save acs_wide, replace
. // temp
. save acs, replace
file acs.dta saved
. reshape wide age edu race female hispanic income marital_status, i(household)
> j(person)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 27,410 -> 10,565
Number of variables 10 -> 114
j variable (16 values) person -> (dropped)
xij variables:
age -> age1 age2 ... age16
edu -> edu1 edu2 ... edu16
race -> race1 race2 ... race16
female -> female1 female2 ... female16
hispanic -> hispanic1 hispanic2 ... hispanic
> 16
income -> income1 income2 ... income16
marital_status -> marital_status1 marital_status2
> ... marital_status16
-----------------------------------------------------------------------------
. save acs_wide, replace
(file acs_wide.dta not found)
file acs_wide.dta saved
.
Review
As a review, here are the first steps you should take with your data. Many of them are very quick; and the ones that take more time are especially valuable. Do not skip these steps!
Read the Documentation
Identify the Variables
Look at the Data
Find the Identifiers and Figure Out the Data Structure
Get Rid of Data You Won’t Use
Change Variable Names
Convert String Variables that Contain Numbers to Numeric Variables