set linesize 100
12 Exercise Solutions
12.1 Reading in Data
12.1.1 Exercise 1
The details will depend on where you want to put the example files, your username, and the kind of computer you’re using. The example code in the text and repeated here assumes you want to put the example files in a folder on your desktop and your username is bbadger
. Modify as needed.
Windows:
cd c:\users\bbadger\desktop
mkdir dws
cd dws
net get dws, from(https://ssc.wisc.edu/sscc/stata/)
MacOS or Linux:
cd /users/bbadger/desktop
mkdir dws
cd dws
net get dws, from(https://ssc.wisc.edu/sscc/stata/)
Exercse 2
clear
use 2000_acs_sample
. clear
. use 2000_acs_sample
.
Exercise 3
After editing to remove the absolute path:
clear import delimited qualtrics_survey.csv, varnames(1) rowrange(4) colrange(18:21)
(encoding automatically selected: ISO-8859-1)
(4 vars, 69 obs)
Take a look at the values of q1
with tab
:
tab q1
Q1 | Freq. Percent Cum.
-------------------------+-----------------------------------
18 | 7 10.14 10.14
19 | 20 28.99 39.13
20 | 23 33.33 72.46
21 | 10 14.49 86.96
22 | 1 1.45 88.41
24 | 3 4.35 92.75
25 | 1 1.45 94.20
27 | 1 1.45 95.65
31 | 1 1.45 97.10
40 | 1 1.45 98.55
Between 20-25 | 1 1.45 100.00
-------------------------+-----------------------------------
Total | 69 100.00
When asked their age, someone typed in “Between 20-25” instead of a number. This forced Stata to treat the entire column as text. You’ll learn how to deal with problems like this in First Steps With Your Data.
For real work it’s easier to have Qualtrics give you an SPSS data set. Stata can read SPSS data sets, and they’re already structured as data sets so they generally require less work to read.
12.2 First Steps With Your Data
Excercise 1
Be sure to create a new do file for this exercise because it uses a separate dataset. Call it first_steps_exercises.do
. Then be sure to switch back to first_steps.do
when you go back to following the examples in the book.
The output from label list
is quite long, so for the book we’ll comment it out and only list a few specific value labels. You should run the full version and skim the results though!
capture log close
log using first_steps_exercises.log, replace
clear all
use 2000_acs_sample_harm
describe
// label list
label list race_lbl
label list educ_lbl
. capture log close
. log using first_steps_exercises.log, replace
----------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Users\rdimond\dws\first_steps_exercises.log
log type: text
opened on: 11 Jan 2023, 14:33:29
.
. clear all
. use 2000_acs_sample_harm
.
. describe
Contains data from 2000_acs_sample_harm.dta
Observations: 28,085
Variables: 18 19 Mar 2019 13:50
----------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------
year int %8.0g year_lbl Census year
datanum byte %8.0g Data set number
serial double %8.0g Household serial number
hhwt double %10.2f Household weight
gq byte %43.0g gq_lbl Group quarters status
pernum int %8.0g Person number in sample unit
perwt double %10.2f Person weight
sex byte %8.0g sex_lbl Sex
age int %36.0g age_lbl Age
marst byte %23.0g marst_lbl
Marital status
race byte %32.0g race_lbl Race [general version]
raced int %182.0g raced_lbl
Race [detailed version]
hispan byte %12.0g hispan_lbl
Hispanic origin [general version]
hispand int %24.0g hispand_lbl
Hispanic origin [detailed version]
educ byte %25.0g educ_lbl Educational attainment [general version]
educd int %46.0g educd_lbl
Educational attainment [detailed version]
inctot long %12.0g Total personal income
ftotinc long %12.0g Total family income
----------------------------------------------------------------------------------------------------
Sorted by:
. // label list
. label list race_lbl
race_lbl:
1 White
2 Black/African American/Negro
3 American Indian or Alaska Native
4 Chinese
5 Japanese
6 Other Asian or Pacific Islander
7 Other race, nec
8 Two major races
9 Three or more major races
. label list educ_lbl
educ_lbl:
0 N/A or no schooling
1 Nursery school to grade 4
2 Grade 5, 6, 7, or 8
3 Grade 9
4 Grade 10
5 Grade 11
6 Grade 12
7 1 year of college
8 2 years of college
9 3 years of college
10 4 years of college
11 5+ years of college
.
Note how the variable names have been cleaned up and many more variables have value labels applied you can tell what they actually mean.
Since we can’t put a data browser in this book, here’s a list of the first 10 observations for a few key variables, including inctot
:
list serial pernum sex age inctot in 1/10
+------------------------------------------------------------+
| serial pernum sex age inctot |
|------------------------------------------------------------|
1. | 202721 3 Female 7 9999999 |
2. | 1.2e+06 7 Male 11 9999999 |
3. | 78909 3 Male 16 15000 |
4. | 570434 1 Male 32 18000 |
5. | 620890 1 Male 52 59130 |
|------------------------------------------------------------|
6. | 575621 2 Female 51 79350 |
7. | 1.1e+06 1 Male 34 15000 |
8. | 36467 2 Female 74 7000 |
9. | 869072 3 Female 3 9999999 |
10. | 201108 4 Female Less than 1 year old 9999999 |
+------------------------------------------------------------+
Four out of ten observations have an income of exactly $9,999,999 and they’re all children…not very plausible. In reality 9999999
is a code for “missing.” We’ll discuss dealing with missing values soon, but clearly something must be done: calculating a mean income with the values given here would give you a very wrong answer.
Exercise 3
First open the data and look at the variables:
clear
use atus
describe
. clear
. use atus
.
. describe
Contains data from atus.dta
Observations: 199,894
Variables: 21 11 Apr 2019 15:44
----------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------
year long %12.0g Survey year
caseid double %20.0f ATUS Case ID
famincome int %20.0g famincome_lbl
Family income
pernum byte %8.0g pernum_lbl
Person number (general)
lineno int %21.0g lineno_lbl
Person line number
wt06 double %17.0g Person weight, 2006 methodology
age int %8.0g Age
sex byte %21.0g sex_lbl Sex
race int %42.0g race_lbl Race
hispan int %22.0g hispan_lbl
Hispanic origin
asian int %12.0g asian_lbl
Asian origin
marst byte %24.0g marst_lbl
Marital status
educ int %47.0g educ_lbl Highest level of school completed
educyrs int %36.0g educyrs_lbl
Years of education
empstat byte %22.0g empstat_lbl
Labor force status
fullpart byte %21.0g fullpart_lbl
Full time/part time employment status
uhrsworkt int %8.0g Hours usually worked per week
earnweek double %7.2f Weekly earnings
actline byte %8.0g Activity line number
activity long %73.0g activity_lbl
Activity
duration int %8.0g Duration of activity
----------------------------------------------------------------------------------------------------
Sorted by:
.
caseid
is certainly an identifier, but does it uniquely identify observations?
duplicates report caseid
Duplicates in terms of caseid
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
5 | 300 240
6 | 648 540
7 | 1162 996
8 | 1688 1477
9 | 2529 2248
10 | 3320 2988
11 | 4180 3800
12 | 5616 5148
13 | 6669 6156
14 | 7168 6656
15 | 7740 7224
16 | 8576 8040
17 | 10098 9504
18 | 9270 8755
19 | 10412 9864
20 | 9600 9120
21 | 9996 9520
22 | 9064 8652
23 | 8510 8140
24 | 8904 8533
25 | 7925 7608
26 | 7072 6800
27 | 6669 6422
28 | 5236 5049
29 | 5017 4844
30 | 4950 4785
31 | 4247 4110
32 | 3872 3751
33 | 3102 3008
34 | 3264 3168
35 | 2835 2754
36 | 2808 2730
37 | 2257 2196
38 | 1900 1850
39 | 1443 1406
40 | 1960 1911
41 | 1353 1320
42 | 966 943
43 | 903 882
44 | 836 817
45 | 720 704
46 | 460 450
47 | 752 736
48 | 528 517
49 | 588 576
50 | 100 98
51 | 408 400
52 | 208 204
53 | 159 156
54 | 108 106
55 | 385 378
56 | 112 110
57 | 114 112
58 | 116 114
59 | 59 58
60 | 120 118
61 | 183 180
63 | 126 124
64 | 128 126
68 | 68 67
70 | 70 69
71 | 71 70
77 | 77 76
79 | 79 78
90 | 90 89
--------------------------------------
Definitely not, so we need to add something to it. pernum
and lineno
seem plausible but they’re both just 1 for everyone. Try actline
instead:
duplicates report caseid actline
Duplicates in terms of caseid actline
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 199894 0
--------------------------------------
Now we know that a combination of caseid
and actline
uniquely identify observations, but also that an observation in the data set is an activity, with the activities grouped by the “case” (person) who did them.
To find the first activity for person 20170101170012 we use those variables as indexes:
list activity if caseid==20170101170012 & actline==1
+----------+
| activity |
|----------|
20. | Sleeping |
+----------+
Exercise 4
To save space we’ll only run codebook
on selected variables:
codebook famincome hispan asian
----------------------------------------------------------------------------------------------------
famincome Family income
----------------------------------------------------------------------------------------------------
Type: Numeric (int)
Label: famincome_lbl
Range: [1,16] Units: 1
Unique values: 16 Missing .: 0/199,894
Examples: 7 $20,000 to $24,999
11 $40,000 to $49,999
13 $60,000 to $74,999
15 $100,000 to $149,999
----------------------------------------------------------------------------------------------------
hispan Hispanic origin
----------------------------------------------------------------------------------------------------
Type: Numeric (int)
Label: hispan_lbl
Range: [100,250] Units: 1
Unique values: 9 Missing .: 0/199,894
Tabulation: Freq. Numeric Label
170,986 100 Not Hispanic
16,754 210 Mexican
2,675 220 Puerto Rican
1,294 230 Cuban
960 241 Dominican
881 242 Salvadoran
1,940 243 Other Central American
2,502 244 South American
1,902 250 Other Spanish
----------------------------------------------------------------------------------------------------
asian Asian origin
----------------------------------------------------------------------------------------------------
Type: Numeric (int)
Label: asian_lbl
Range: [10,999] Units: 1
Unique values: 8 Missing .: 0/199,894
Tabulation: Freq. Numeric Label
2,348 10 Asian Indian
1,771 20 Chinese
1,286 30 Filipino
566 40 Japanese
529 50 Korean
459 60 Vietnamese
1,365 70 Other Asian
191,570 999 NIU
famincome
is categorical, with the categories representing income ranges. Sometimes you’ll see people convert such variables to quantitative variables by setting each person’s income to the midpoint of the range. That would be their expected value of income if incomes within a range were uniformly (or even symmetrically) distributed. They are not.
hispan
and asian
are again categorical where you might have expected indicator. But note how hispan
has a straightforward “Not Hispanic” category, while asian
has “NIU” with a numeric value that looks like a code for missing. “NIU” stands for a charming bit of Census Bureau terminology we’ll learn about shortly.
Exercise 8
clear
use first_steps2
. clear
. use first_steps2
.
hist age
(bin=44, start=0, width=2.1136364)
hist age, discrete
(start=0, width=1)
It turns out age
is top-coded, and 93 really means “93 or older.” Presumably this is to preserve privacy. This could have a big impact on some forms of analysis. Of course you already know this if you read all of the codebook–always read the data documentation!
Exercise 9
“Other” is coded as 8. Hopefully you looked at several variables, but the one of interest is hispanic
:
tab hispanic if race==8
Person is |
hispanic | Freq. Percent Cum.
------------+-----------------------------------
0 | 58 3.65 3.65
1 | 1,529 96.35 100.00
------------+-----------------------------------
Total | 1,587 100.00
In other words, almost all the people who marked “Other” for race are Hispanic, though not all Hispanics chose “Other”:
tab race if hispanic
Race Recode 1 | Freq. Percent Cum.
--------------------------+-----------------------------------
White | 1,707 48.13 48.13
Black | 57 1.61 49.73
American Indian | 11 0.31 50.04
Alaska Native | 1 0.03 50.07
Indigenous, tribe unknown | 22 0.62 50.69
Asian | 3 0.08 50.78
Pacific Islander | 1 0.03 50.80
Other | 1,529 43.11 93.91
Multiracial | 216 6.09 100.00
--------------------------+-----------------------------------
Total | 3,547 100.00
Social scientists think of “Hispanic” as an ethnicity, not a race. People who are Hispanic may be White, Black, etc. But laypeople often think of Hispanic as a race. Thus almost half of Hispanics looked at the list of available races and chose “Other” because Hispanic was not on the list. The 2010 Census tried to avoid this problem by adding a note to the instructions: “For this census, Hispanic Origins are not races.”
Excercise 11
Most children under the age of 15 probably have zero income, though some may have a small amount of income. Certainly their distribution of income
is very different and generally lower than the observed distribution. If these values were observed, the mean of income
would be much lower, so using the sample mean as an estimator of the population mean is biased upward by the missing data. Fortunately we’re usually interested in the income of adults.
12.3 Working With Hierarchical Data
clear
use acs
. clear
. use acs
.
Exercise 1
In this data set a level two unit is a person, a level one unit is a person-month combination (you can call it a month as long as it’s understood two people observed in the same month is two observations, not one), years_edu
is a level two variable because it is always the same for a given person (it does not change over time), and employed
is a level one variable because it is not always the same for a given person.
Exercise 2
by household: egen mean_age = mean(age)
list household person age mean_age in 1/5
. by household: egen mean_age = mean(age)
. list household person age mean_age in 1/5
+------------------------------------+
| househ~d person age mean_age |
|------------------------------------|
1. | 37 1 20 19.33333 |
2. | 37 2 19 19.33333 |
3. | 37 3 19 19.33333 |
4. | 241 1 50 50 |
5. | 242 1 29 29 |
+------------------------------------+
.
Exercise 3
Now the subset of interest is the adults:
gen age_if_adult = age if age>=18
by household: egen mean_adult_age = mean(age_if_adult)
list household person age mean_adult_age if household==8787, ab(30)
. gen age_if_adult = age if age>=18
(7,292 missing values generated)
. by household: egen mean_adult_age = mean(age_if_adult)
(3 missing values generated)
. list household person age mean_adult_age if household==8787, ab(30)
+-------------------------------------------+
| household person age mean_adult_age |
|-------------------------------------------|
179. | 8787 1 45 45 |
180. | 8787 2 16 45 |
181. | 8787 3 15 45 |
+-------------------------------------------+
.
Exercise 4
First remind yourself of the coding scheme for edu
:
label list edu_label
edu_label:
0 Not in universe
1 None
2 Nursery school-4th grade
3 5th-6th grade
4 7th-8th grade
5 9th grade
6 10th grade
7 11th grade
8 12th grade, no diploma
9 High School graduate
10 Some college, <1 year
11 Some college, >=1 year
12 Associate degree
13 Bachelor's degree
14 Master's degree
15 Professional degree
16 Doctorate degree
Thus someone is a college graduate if edu
is greater than 11 but not missing:
gen grad = (edu>11 & edu<.)
Now you’re ready to create your household-level variables:
by household: egen num_grads = total(grad)
by household: egen prop_grads = mean(grad)
by household: egen has_grad = max(grad)
list household person age edu num_grads prop_grads has_grad if household==4398, ab(30)
. by household: egen num_grads = total(grad)
. by household: egen prop_grads = mean(grad)
. by household: egen has_grad = max(grad)
. list household person age edu num_grads prop_grads has_grad if household==4398, ab(30)
+------------------------------------------------------------------------------------+
| household person age edu num_grads prop_grads has_grad |
|------------------------------------------------------------------------------------|
93. | 4398 1 33 Professional degree 2 .4 1 |
94. | 4398 2 34 Master's degree 2 .4 1 |
95. | 4398 3 5 None 2 .4 1 |
96. | 4398 4 3 None 2 .4 1 |
97. | 4398 5 0 . 2 .4 1 |
+------------------------------------------------------------------------------------+
.
“All the adults in the household are college graduates” is a little trickier. One approach would be to treat this as a subsetting problem and create edu_if_adult
and apply min()
to it. An alternative is to focus on combining conditions. An individual makes “all the adults in the household are college graduates” false if they are an adult and not a college graduate. They do their part to make it true if they are a child or a college graduate:
by household: egen all_adults_grad = min(age<18 | edu>11)
list household person age edu all_adults_grad if household==4398, ab(30)
. by household: egen all_adults_grad = min(age<18 | edu>11)
. list household person age edu all_adults_grad if household==4398, ab(30)
+------------------------------------------------------------------+
| household person age edu all_adults_grad |
|------------------------------------------------------------------|
93. | 4398 1 33 Professional degree 1 |
94. | 4398 2 34 Master's degree 1 |
95. | 4398 3 5 None 1 |
96. | 4398 4 3 None 1 |
97. | 4398 5 0 . 1 |
+------------------------------------------------------------------+
.
Exercise 5
If you find this exercise difficult, you may want to read the longer introduction to loops in Stata Programming Essentials. It takes the time to explain things better.
First, load acs_wide
if you haven’t already, and remind yourself how race is coded:
clear
use acs_wide
label list race_label
. clear
. use acs_wide
.
. label list race_label
race_label:
1 White
2 Black
3 American Indian
4 Alaska Native
5 Indigenous, tribe unknown
6 Asian
7 Pacific Islander
8 Other
9 Multiracial
.
forvalues i = 1/16 {
gen black`i' = (race`i'==2) if race`i' < .
}
egen has_black_member = rowmax(black*)
egen all_black = rowmin(black*)
. forvalues i = 1/16 {
2. gen black`i' = (race`i'==2) if race`i' < .
3. }
(2,779 missing values generated)
(6,191 missing values generated)
(7,876 missing values generated)
(9,355 missing values generated)
(10,115 missing values generated)
(10,388 missing values generated)
(10,484 missing values generated)
(10,531 missing values generated)
(10,544 missing values generated)
(10,554 missing values generated)
(10,557 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
(10,564 missing values generated)
.
. egen has_black_member = rowmax(black*)
. egen all_black = rowmin(black*)
.
Just so you can be sure your variables were created properly:
tab has_black_member
tab all_black
. tab has_black_member
has_black_m |
ember | Freq. Percent Cum.
------------+-----------------------------------
0 | 9,299 88.02 88.02
1 | 1,266 11.98 100.00
------------+-----------------------------------
Total | 10,565 100.00
. tab all_black
all_black | Freq. Percent Cum.
------------+-----------------------------------
0 | 9,417 89.13 89.13
1 | 1,148 10.87 100.00
------------+-----------------------------------
Total | 10,565 100.00
.
12.3.1 Exercise 6
clear
use nlsy
describe
duplicates report id year
. clear
. use nlsy
.
. describe
Contains data from nlsy.dta
Observations: 241,034
Variables: 6 27 Dec 2022 13:11
----------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------
id float %9.0g ID# (1-12686) 79
year int %9.0g
year_of_birth float %16.0g
edu float %24.0g edulabel
income float %9.0g
age float %9.0g
----------------------------------------------------------------------------------------------------
Sorted by: id year
. duplicates report id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 241034 0
--------------------------------------
.
Since id
and year
uniquely identify observations, we have one row per person per year. A level 2 unit is a person, a level 1 unit is a person-year combination, and the data are in long form.
To learn more we’ll list the data for person 2 (person 1 is missing a lot) but you should look at the data browser:
list if id==2, ab(30)
+-------------------------------------------------------+
| id year year_of_birth edu income age |
|-------------------------------------------------------|
20. | 2 1979 59 9TH GRADE 4000 20 |
21. | 2 1980 59 9TH GRADE 5000 21 |
22. | 2 1981 59 9TH GRADE 6000 22 |
23. | 2 1982 59 9TH GRADE 10000 23 |
24. | 2 1983 59 9TH GRADE 11000 24 |
|-------------------------------------------------------|
25. | 2 1984 59 9TH GRADE 11500 25 |
26. | 2 1985 59 12TH GRADE 11000 26 |
27. | 2 1986 59 12TH GRADE 14000 27 |
28. | 2 1987 59 12TH GRADE 16000 28 |
29. | 2 1988 59 12TH GRADE 20000 29 |
|-------------------------------------------------------|
30. | 2 1989 59 12TH GRADE 19000 30 |
31. | 2 1990 59 12TH GRADE 20000 31 |
32. | 2 1991 59 12TH GRADE 20000 32 |
33. | 2 1992 59 12TH GRADE 22000 33 |
34. | 2 1993 59 12TH GRADE 25000 34 |
|-------------------------------------------------------|
35. | 2 1994 59 12TH GRADE 0 35 |
36. | 2 1996 59 12TH GRADE 0 37 |
37. | 2 1998 59 12TH GRADE 0 39 |
38. | 2 2000 59 12TH GRADE 0 41 |
+-------------------------------------------------------+
year_of_birth
is a level 2 or person-level variable (it does not change), while edu
, income
, and age
are level 1 or year-level variables.
tab edu
edu | Freq. Percent Cum.
-------------------------+-----------------------------------
NONE | 77 0.04 0.04
1ST GRADE | 55 0.03 0.07
2ND GRADE | 2 0.00 0.07
3RD GRADE | 248 0.13 0.19
4TH GRADE | 245 0.12 0.32
5TH GRADE | 263 0.13 0.45
6TH GRADE | 970 0.49 0.94
7TH GRADE | 2,032 1.03 1.97
8TH GRADE | 6,239 3.16 5.13
9TH GRADE | 10,748 5.44 10.57
10TH GRADE | 12,667 6.41 16.98
11TH GRADE | 13,610 6.89 23.86
12TH GRADE | 83,641 42.32 66.19
1ST YEAR COLLEGE | 17,291 8.75 74.94
2ND YEAR COLLEGE | 16,001 8.10 83.03
3RD YEAR COLLEGE | 8,465 4.28 87.32
4TH YEAR COLLEGE | 17,106 8.66 95.97
5TH YEAR COLLEGE | 3,292 1.67 97.64
6TH YEAR COLLEGE | 2,546 1.29 98.93
7TH YEAR COLLEGE | 1,173 0.59 99.52
8TH YEAR COLLEGE OR MORE | 946 0.48 100.00
-------------------------+-----------------------------------
Total | 197,617 100.00
edu
is labeled as a categorical variable, but the underlying values are also number of years of school (except for the last one) so you can treat it as quantitative:
tab edu, nolabel
edu | Freq. Percent Cum.
------------+-----------------------------------
0 | 77 0.04 0.04
1 | 55 0.03 0.07
2 | 2 0.00 0.07
3 | 248 0.13 0.19
4 | 245 0.12 0.32
5 | 263 0.13 0.45
6 | 970 0.49 0.94
7 | 2,032 1.03 1.97
8 | 6,239 3.16 5.13
9 | 10,748 5.44 10.57
10 | 12,667 6.41 16.98
11 | 13,610 6.89 23.86
12 | 83,641 42.32 66.19
13 | 17,291 8.75 74.94
14 | 16,001 8.10 83.03
15 | 8,465 4.28 87.32
16 | 17,106 8.66 95.97
17 | 3,292 1.67 97.64
18 | 2,546 1.29 98.93
19 | 1,173 0.59 99.52
20 | 946 0.48 100.00
------------+-----------------------------------
Total | 197,617 100.00
Now let’s look at person 1:
list if id==1, ab(30)
+-------------------------------------------------------+
| id year year_of_birth edu income age |
|-------------------------------------------------------|
1. | 1 1979 58 12TH GRADE 4620 21 |
2. | 1 1980 58 . . 22 |
3. | 1 1981 58 12TH GRADE 5000 23 |
4. | 1 1982 58 . . 24 |
5. | 1 1983 58 . . 25 |
|-------------------------------------------------------|
6. | 1 1984 58 . . 26 |
7. | 1 1985 58 . . 27 |
8. | 1 1986 58 . . 28 |
9. | 1 1987 58 . . 29 |
10. | 1 1988 58 . . 30 |
|-------------------------------------------------------|
11. | 1 1989 58 . . 31 |
12. | 1 1990 58 . . 32 |
13. | 1 1991 58 . . 33 |
14. | 1 1992 58 . . 34 |
15. | 1 1993 58 . . 35 |
|-------------------------------------------------------|
16. | 1 1994 58 . . 36 |
17. | 1 1996 58 . . 38 |
18. | 1 1998 58 . . 40 |
19. | 1 2000 58 . . 42 |
+-------------------------------------------------------+
Collecting longitudinal data requires tracking people down repeatedly over a period of years. When the NLSY couldn’t find a subject in a given year, they could not collect data about their education and income in that year and entered missing values. The fact that age
is never missing tells us it is calculated rather than collected.
Note that many longitudinal data sets will simply not have a row for a person for a time period they could not collect data about them. Also, the NLSY is very persistent about trying to contact people even if they failed to find them in the past. Other data sets give up on people more quickly so you get a lot of attrition.
Exercise 7 {.unnumbered}
by id: gen ending_income = income[_N]
(97,128 missing values generated)
Exercise 8
Now we’re looking for a year in which the subject’s education is 12 and their education the following year is greater than 12. But remember missing is essentially infinity and thus definitely greater than 12, so this time you have to deal with missing values explicitly:
by id: gen start_college = (edu==12 & (edu[_n+1]>12 & edu[_n+1]<.))
by id: egen times_started = total(start_college)
tab times_started
. by id: gen start_college = (edu==12 & (edu[_n+1]>12 & edu[_n+1]<.))
. by id: egen times_started = total(start_college)
. tab times_started
times_start |
ed | Freq. Percent Cum.
------------+-----------------------------------
0 | 172,349 71.50 71.50
1 | 68,628 28.47 99.98
2 | 57 0.02 100.00
------------+-----------------------------------
Total | 241,034 100.00
.
Now we have people who started college multiple times. Take a look to see why:
list id year edu start_college times_started if times_started>1, ab(30)
+----------------------------------------------------------------+
| id year edu start_college times_started |
|----------------------------------------------------------------|
64601. | 3401 1979 11TH GRADE 0 2 |
64602. | 3401 1980 12TH GRADE 0 2 |
64603. | 3401 1981 12TH GRADE 0 2 |
64604. | 3401 1982 12TH GRADE 0 2 |
64605. | 3401 1983 12TH GRADE 0 2 |
|----------------------------------------------------------------|
64606. | 3401 1984 12TH GRADE 0 2 |
64607. | 3401 1985 12TH GRADE 1 2 |
64608. | 3401 1986 1ST YEAR COLLEGE 0 2 |
64609. | 3401 1987 1ST YEAR COLLEGE 0 2 |
64610. | 3401 1988 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
64611. | 3401 1989 1ST YEAR COLLEGE 0 2 |
64612. | 3401 1990 1ST YEAR COLLEGE 0 2 |
64613. | 3401 1991 1ST YEAR COLLEGE 0 2 |
64614. | 3401 1992 1ST YEAR COLLEGE 0 2 |
64615. | 3401 1993 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
64616. | 3401 1994 12TH GRADE 0 2 |
64617. | 3401 1996 12TH GRADE 0 2 |
64618. | 3401 1998 12TH GRADE 1 2 |
64619. | 3401 2000 1ST YEAR COLLEGE 0 2 |
66881. | 3521 1979 9TH GRADE 0 2 |
|----------------------------------------------------------------|
66882. | 3521 1980 10TH GRADE 0 2 |
66883. | 3521 1981 11TH GRADE 0 2 |
66884. | 3521 1982 12TH GRADE 1 2 |
66885. | 3521 1983 1ST YEAR COLLEGE 0 2 |
66886. | 3521 1984 12TH GRADE 1 2 |
|----------------------------------------------------------------|
66887. | 3521 1985 1ST YEAR COLLEGE 0 2 |
66888. | 3521 1986 1ST YEAR COLLEGE 0 2 |
66889. | 3521 1987 2ND YEAR COLLEGE 0 2 |
66890. | 3521 1988 2ND YEAR COLLEGE 0 2 |
66891. | 3521 1989 2ND YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
66892. | 3521 1990 2ND YEAR COLLEGE 0 2 |
66893. | 3521 1991 2ND YEAR COLLEGE 0 2 |
66894. | 3521 1992 2ND YEAR COLLEGE 0 2 |
66895. | 3521 1993 2ND YEAR COLLEGE 0 2 |
66896. | 3521 1994 2ND YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
66897. | 3521 1996 2ND YEAR COLLEGE 0 2 |
66898. | 3521 1998 2ND YEAR COLLEGE 0 2 |
66899. | 3521 2000 1ST YEAR COLLEGE 0 2 |
152077. | 8005 1979 12TH GRADE 1 2 |
152078. | 8005 1980 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152079. | 8005 1981 12TH GRADE 0 2 |
152080. | 8005 1982 12TH GRADE 0 2 |
152081. | 8005 1983 12TH GRADE 1 2 |
152082. | 8005 1984 1ST YEAR COLLEGE 0 2 |
152083. | 8005 1985 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152084. | 8005 1986 1ST YEAR COLLEGE 0 2 |
152085. | 8005 1987 1ST YEAR COLLEGE 0 2 |
152086. | 8005 1988 1ST YEAR COLLEGE 0 2 |
152087. | 8005 1989 1ST YEAR COLLEGE 0 2 |
152088. | 8005 1990 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152089. | 8005 1991 1ST YEAR COLLEGE 0 2 |
152090. | 8005 1992 1ST YEAR COLLEGE 0 2 |
152091. | 8005 1993 1ST YEAR COLLEGE 0 2 |
152092. | 8005 1994 1ST YEAR COLLEGE 0 2 |
152093. | 8005 1996 1ST YEAR COLLEGE 0 2 |
|----------------------------------------------------------------|
152094. | 8005 1998 1ST YEAR COLLEGE 0 2 |
152095. | 8005 2000 1ST YEAR COLLEGE 0 2 |
+----------------------------------------------------------------+
In both cases the subject reported “1ST YEAR COLLEGE” for their education for a while, then when back to reporting “12TH GRADE” for a while before reporting college again. That’s a problem with the data, not your code, but you may need to change your code to deal with it.
Exercise 9
Two options:
gen age25 = (age==25)
sort id age25
by id: gen income25a = income[_N] if age25[_N]
sort id year
gen income_if_25 = income if age==25
by id: egen income25b = mean(income_if_25)
assert income25a==income25b
. gen age25 = (age==25)
. sort id age25
. by id: gen income25a = income[_N] if age25[_N]
(22,686 missing values generated)
. sort id year
.
. gen income_if_25 = income if age==25
(229,542 missing values generated)
. by id: egen income25b = mean(income_if_25)
(22,686 missing values generated)
.
. assert income25a==income25b
.
The assert
command tells us they give exactly the same results.
12.4 Restructuring Data
Exercise 1
clear
use nlsy
duplicates report id year
list in 1/6, ab(30)
. clear
. use nlsy
.
. duplicates report id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 241034 0
--------------------------------------
. list in 1/6, ab(30)
+-------------------------------------------------------+
| id year year_of_birth edu income age |
|-------------------------------------------------------|
1. | 1 1979 58 12TH GRADE 4620 21 |
2. | 1 1980 58 . . 22 |
3. | 1 1981 58 12TH GRADE 5000 23 |
4. | 1 1982 58 . . 24 |
5. | 1 1983 58 . . 25 |
|-------------------------------------------------------|
6. | 1 1984 58 . . 26 |
+-------------------------------------------------------+
.
The unique identifiers are id
and year
, telling us we have one observation per person (id)
per year. A person-year combination (or just year) is a level one unit, and a person is a level two unit. Looking at a few observations we see that year_of_birth
is a level two variable (it does not change over time) and edu
, income
, and age
are level one variables.
duplicates report id
reshape wide edu income age, i(id) j(year)
reshape long edu income age, i(id) j(year)
duplicates report id
. duplicates report id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
19 | 241034 228348
--------------------------------------
. reshape wide edu income age, i(id) j(year)
(j = 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1996 1998 2000)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 241,034 -> 12,686
Number of variables 6 -> 59
j variable (19 values) year -> (dropped)
xij variables:
edu -> edu1979 edu1980 ... edu2000
income -> income1979 income1980 ... income2000
age -> age1979 age1980 ... age2000
-----------------------------------------------------------------------------
. reshape long edu income age, i(id) j(year)
(j = 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1996 1998 2000)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 12,686 -> 241,034
Number of variables 59 -> 6
j variable (19 values) -> year
xij variables:
edu1979 edu1980 ... edu2000 -> edu
income1979 income1980 ... income2000 -> income
age1979 age1980 ... age2000 -> age
-----------------------------------------------------------------------------
. duplicates report id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
19 | 241034 228348
--------------------------------------
.
This data set starts with exactly 19 observations for each person. Even if the NLSY researchers could not find a person in a given year, they created an observation for them with missing data. Thus, unlike the ACS, no new observations are created when the data set is reshaped from wide to long.
Exercise 2
clear
use nlsy
collapse ///
///
(first) year_of_birth mean) income ///
(max) edu, ///
(by(id)
l in 1/5, ab(30)
. clear
. use nlsy
.
. collapse ///
> (first) year_of_birth ///
> (mean) income ///
> (max) edu, ///
> by(id)
.
. l in 1/5, ab(30)
+-------------------------------------+
| id year_of_birth income edu |
|-------------------------------------|
1. | 1 58 4810 12 |
2. | 2 59 11289.47 12 |
3. | 3 61 4735.294 12 |
4. | 4 62 12874 14 |
5. | 5 59 20631 18 |
+-------------------------------------+
.
Aggregating income across multiple decades without correcting for inflation is a very bad idea, but you’ll learn how to correct for inflation as an example in the next chapter.
12.5 Combining Data Sets
Exercise 1
acs_race.dta
and acs_education.dta
contain different variables for the same people (open them and look at the data browser), so they need to be merged.
clear
use acs_race
merge 1:1 household person using acs_education
. clear
. use acs_race
. merge 1:1 household person using acs_education
Result Number of obs
-----------------------------------------
Not matched 0
Matched 27,410 (_merge==3)
-----------------------------------------
.
acs_adults.dta
and acs_children.dta
contain the same variables for different people, so they need to be appended.
clear
use acs_adults
append using acs_children
. clear
. use acs_adults
. append using acs_children
(label edu_label already defined)
(label race_label already defined)
(label marital_status_label already defined)
.
Exercise 2
First take a look at the data sets in the data browser and note the data structure. For the book we’ll skip to confirming the identifiers.
clear
use nlsy7980
duplicates report id year
. clear
. use nlsy7980
. duplicates report id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 25372 0
--------------------------------------
.
nlsy7980
has one observation per person (id
) per year, so it’s in long form.
clear
use nlsy8182
duplicates report id
describe
. clear
. use nlsy8182
. duplicates report id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 12686 0
--------------------------------------
. describe
Contains data from nlsy8182.dta
Observations: 12,686
Variables: 8 27 Dec 2022 13:13
----------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------
id float %9.0g ID# (1-12686) 79
edu1981 float %24.0g edulabel 1981 edu
income1981 float %9.0g 1981 income
age1981 float %9.0g 1981 age
edu1982 float %24.0g edulabel 1982 edu
income1982 float %9.0g 1982 income
age1982 float %9.0g 1982 age
year_of_birth float %16.0g
----------------------------------------------------------------------------------------------------
Sorted by: id
.
nlsy8182
has one observation per person and one set of variables per year, so it’s in wide form. Before we can combine these data sets we must put them in the same form.
If you want the result to be in long form, them you need to reshape nlsy8182
to long form and then use append
:
reshape long edu income age, i(id) j(year)
append using nlsy7980
describe
. reshape long edu income age, i(id) j(year)
(j = 1981 1982)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 12,686 -> 25,372
Number of variables 8 -> 6
j variable (2 values) -> year
xij variables:
edu1981 edu1982 -> edu
income1981 income1982 -> income
age1981 age1982 -> age
-----------------------------------------------------------------------------
. append using nlsy7980
(label edulabel already defined)
. describe
Contains data
Observations: 50,744
Variables: 6
----------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------
id float %9.0g ID# (1-12686) 79
year int %10.0g
edu float %24.0g edulabel
income float %9.0g
age float %9.0g
year_of_birth float %16.0g
----------------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
.
If you want the result to be in wide form, you must reshape nlsy7980 to wide form and then use merge
:
clear
use nlsy7980
reshape wide edu income age, i(id) j(year)
merge 1:1 id using nlsy8182
describe
. clear
. use nlsy7980
. reshape wide edu income age, i(id) j(year)
(j = 1979 1980)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 25,372 -> 12,686
Number of variables 6 -> 8
j variable (2 values) year -> (dropped)
xij variables:
edu -> edu1979 edu1980
income -> income1979 income1980
age -> age1979 age1980
-----------------------------------------------------------------------------
. merge 1:1 id using nlsy8182
(label edulabel already defined)
Result Number of obs
-----------------------------------------
Not matched 0
Matched 12,686 (_merge==3)
-----------------------------------------
. describe
Contains data
Observations: 12,686
Variables: 15
----------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------
id float %9.0g ID# (1-12686) 79
edu1979 float %24.0g edulabel 1979 edu
income1979 float %9.0g 1979 income
age1979 float %9.0g 1979 age
edu1980 float %24.0g edulabel 1980 edu
income1980 float %9.0g 1980 income
age1980 float %9.0g 1980 age
year_of_birth float %16.0g
edu1981 float %24.0g edulabel 1981 edu
income1981 float %9.0g 1981 income
age1981 float %9.0g 1981 age
edu1982 float %24.0g edulabel 1982 edu
income1982 float %9.0g 1982 income
age1982 float %9.0g 1982 age
_merge byte %23.0g _merge Matching result from merge
----------------------------------------------------------------------------------------------------
Sorted by: id
Note: Dataset has changed since last saved.
.
Exercise 3
clear
use nlsy_person
merge 1:m id using nlsy_person_year
. clear
. use nlsy_person
. merge 1:m id using nlsy_person_year
Result Number of obs
-----------------------------------------
Not matched 0
Matched 241,034 (_merge==3)
-----------------------------------------
.
Exercise 4
First take a look at the problem:
clear
use nlsy_error
duplicates report id year
bysort id year: gen copies = _N
list if copies>1, ab(30)
. clear
. use nlsy_error
. duplicates report id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 240974 0
2 | 80 40
--------------------------------------
. bysort id year: gen copies = _N
. list if copies>1, ab(30)
+------------------------------------------------------------------------+
| id year year_of_birth edu income age copies |
|------------------------------------------------------------------------|
9507. | 501 1985 63 3RD YEAR COLLEGE 3000 22 2 |
9508. | 501 1985 63 3RD YEAR COLLEGE 3000 22 2 |
26450. | 1393 1979 58 12TH GRADE 7000 21 2 |
26451. | 1393 1979 58 12TH GRADE 7000 21 2 |
26452. | 1393 1980 58 12TH GRADE 7000 22 2 |
|------------------------------------------------------------------------|
26453. | 1393 1980 58 12TH GRADE 7000 22 2 |
26454. | 1393 1981 58 12TH GRADE 7000 23 2 |
26455. | 1393 1981 58 12TH GRADE 7000 23 2 |
26456. | 1393 1982 58 12TH GRADE 6000 24 2 |
26457. | 1393 1982 58 12TH GRADE 6000 24 2 |
|------------------------------------------------------------------------|
26458. | 1393 1983 58 12TH GRADE 31519 25 2 |
26459. | 1393 1983 58 12TH GRADE 31519 25 2 |
26460. | 1393 1984 58 12TH GRADE 22000 26 2 |
26461. | 1393 1984 58 12TH GRADE 22000 26 2 |
26462. | 1393 1985 58 12TH GRADE 39389 27 2 |
|------------------------------------------------------------------------|
26463. | 1393 1985 58 12TH GRADE 39389 27 2 |
26464. | 1393 1986 58 12TH GRADE 43119 28 2 |
26465. | 1393 1986 58 12TH GRADE 43119 28 2 |
26466. | 1393 1987 58 12TH GRADE 10000 29 2 |
26467. | 1393 1987 58 12TH GRADE 10000 29 2 |
|------------------------------------------------------------------------|
26468. | 1393 1988 58 12TH GRADE 20000 30 2 |
26469. | 1393 1988 58 12TH GRADE 20000 30 2 |
26470. | 1393 1989 58 12TH GRADE 25000 31 2 |
26471. | 1393 1989 58 12TH GRADE 25000 31 2 |
26472. | 1393 1990 58 12TH GRADE 31000 32 2 |
|------------------------------------------------------------------------|
26473. | 1393 1990 58 12TH GRADE 31000 32 2 |
26474. | 1393 1991 58 12TH GRADE 40000 33 2 |
26475. | 1393 1991 58 12TH GRADE 40000 33 2 |
26476. | 1393 1992 58 12TH GRADE 32000 34 2 |
26477. | 1393 1992 58 12TH GRADE 32000 34 2 |
|------------------------------------------------------------------------|
26478. | 1393 1993 58 12TH GRADE 49000 35 2 |
26479. | 1393 1993 58 12TH GRADE 49000 35 2 |
26480. | 1393 1994 58 12TH GRADE 58000 36 2 |
26481. | 1393 1994 58 12TH GRADE 58000 36 2 |
26482. | 1393 1996 58 1ST YEAR COLLEGE 43000 38 2 |
|------------------------------------------------------------------------|
26483. | 1393 1996 58 1ST YEAR COLLEGE 43000 38 2 |
26484. | 1393 1998 58 1ST YEAR COLLEGE 58000 40 2 |
26485. | 1393 1998 58 1ST YEAR COLLEGE 58000 40 2 |
26486. | 1393 2000 58 3RD YEAR COLLEGE 40000 42 2 |
26487. | 1393 2000 58 3RD YEAR COLLEGE 40000 42 2 |
|------------------------------------------------------------------------|
38971. | 2051 1979 64 7TH GRADE . 15 2 |
38972. | 2051 1979 64 8TH GRADE . 15 2 |
38973. | 2051 1980 64 . . 16 2 |
38974. | 2051 1980 64 8TH GRADE . 16 2 |
38975. | 2051 1981 64 . . 17 2 |
|------------------------------------------------------------------------|
38976. | 2051 1981 64 9TH GRADE . 17 2 |
38977. | 2051 1982 64 10TH GRADE . 18 2 |
38978. | 2051 1982 64 . . 18 2 |
38979. | 2051 1983 64 11TH GRADE 2180 19 2 |
38980. | 2051 1983 64 12TH GRADE 1500 19 2 |
|------------------------------------------------------------------------|
38981. | 2051 1984 64 1ST YEAR COLLEGE 2000 20 2 |
38982. | 2051 1984 64 11TH GRADE 4500 20 2 |
38983. | 2051 1985 64 2ND YEAR COLLEGE 2000 21 2 |
38984. | 2051 1985 64 11TH GRADE 5546 21 2 |
38985. | 2051 1986 64 11TH GRADE 10000 22 2 |
|------------------------------------------------------------------------|
38986. | 2051 1986 64 2ND YEAR COLLEGE 3000 22 2 |
38987. | 2051 1987 64 3RD YEAR COLLEGE 6400 23 2 |
38988. | 2051 1987 64 11TH GRADE 11000 23 2 |
38989. | 2051 1988 64 4TH YEAR COLLEGE 12000 24 2 |
38990. | 2051 1988 64 11TH GRADE 7000 24 2 |
|------------------------------------------------------------------------|
38991. | 2051 1989 64 11TH GRADE 2500 25 2 |
38992. | 2051 1989 64 4TH YEAR COLLEGE 22000 25 2 |
38993. | 2051 1990 64 11TH GRADE 9000 26 2 |
38994. | 2051 1990 64 . . 26 2 |
38995. | 2051 1991 64 4TH YEAR COLLEGE 28000 27 2 |
|------------------------------------------------------------------------|
38996. | 2051 1991 64 11TH GRADE 8556 27 2 |
38997. | 2051 1992 64 4TH YEAR COLLEGE 28000 28 2 |
38998. | 2051 1992 64 11TH GRADE 10000 28 2 |
38999. | 2051 1993 64 4TH YEAR COLLEGE 30000 29 2 |
39000. | 2051 1993 64 11TH GRADE 14000 29 2 |
|------------------------------------------------------------------------|
39001. | 2051 1994 64 4TH YEAR COLLEGE 39000 30 2 |
39002. | 2051 1994 64 11TH GRADE 13000 30 2 |
39003. | 2051 1996 64 4TH YEAR COLLEGE 55000 32 2 |
39004. | 2051 1996 64 11TH GRADE 15000 32 2 |
39005. | 2051 1998 64 11TH GRADE 17751 34 2 |
|------------------------------------------------------------------------|
39006. | 2051 1998 64 4TH YEAR COLLEGE 60000 34 2 |
39007. | 2051 2000 64 . . 36 2 |
39008. | 2051 2000 64 11TH GRADE 30000 36 2 |
123731. | 6512 1980 57 12TH GRADE 0 23 2 |
123732. | 6512 1980 57 12TH GRADE 2900 24 2 |
+------------------------------------------------------------------------+
.
Some of these are completely duplicate observations which can be remedied with duplicates drop
:
duplicates drop
Duplicates in terms of all variables
(20 observations deleted)
Now examine the remaining problems:
by id year: replace copies = _N
list if copies>1, ab(30)
. by id year: replace copies = _N
(20 real changes made)
. list if copies>1, ab(30)
+------------------------------------------------------------------------+
| id year year_of_birth edu income age copies |
|------------------------------------------------------------------------|
38951. | 2051 1979 64 7TH GRADE . 15 2 |
38952. | 2051 1979 64 8TH GRADE . 15 2 |
38953. | 2051 1980 64 . . 16 2 |
38954. | 2051 1980 64 8TH GRADE . 16 2 |
38955. | 2051 1981 64 . . 17 2 |
|------------------------------------------------------------------------|
38956. | 2051 1981 64 9TH GRADE . 17 2 |
38957. | 2051 1982 64 10TH GRADE . 18 2 |
38958. | 2051 1982 64 . . 18 2 |
38959. | 2051 1983 64 11TH GRADE 2180 19 2 |
38960. | 2051 1983 64 12TH GRADE 1500 19 2 |
|------------------------------------------------------------------------|
38961. | 2051 1984 64 1ST YEAR COLLEGE 2000 20 2 |
38962. | 2051 1984 64 11TH GRADE 4500 20 2 |
38963. | 2051 1985 64 2ND YEAR COLLEGE 2000 21 2 |
38964. | 2051 1985 64 11TH GRADE 5546 21 2 |
38965. | 2051 1986 64 11TH GRADE 10000 22 2 |
|------------------------------------------------------------------------|
38966. | 2051 1986 64 2ND YEAR COLLEGE 3000 22 2 |
38967. | 2051 1987 64 3RD YEAR COLLEGE 6400 23 2 |
38968. | 2051 1987 64 11TH GRADE 11000 23 2 |
38969. | 2051 1988 64 4TH YEAR COLLEGE 12000 24 2 |
38970. | 2051 1988 64 11TH GRADE 7000 24 2 |
|------------------------------------------------------------------------|
38971. | 2051 1989 64 11TH GRADE 2500 25 2 |
38972. | 2051 1989 64 4TH YEAR COLLEGE 22000 25 2 |
38973. | 2051 1990 64 11TH GRADE 9000 26 2 |
38974. | 2051 1990 64 . . 26 2 |
38975. | 2051 1991 64 4TH YEAR COLLEGE 28000 27 2 |
|------------------------------------------------------------------------|
38976. | 2051 1991 64 11TH GRADE 8556 27 2 |
38977. | 2051 1992 64 4TH YEAR COLLEGE 28000 28 2 |
38978. | 2051 1992 64 11TH GRADE 10000 28 2 |
38979. | 2051 1993 64 4TH YEAR COLLEGE 30000 29 2 |
38980. | 2051 1993 64 11TH GRADE 14000 29 2 |
|------------------------------------------------------------------------|
38981. | 2051 1994 64 4TH YEAR COLLEGE 39000 30 2 |
38982. | 2051 1994 64 11TH GRADE 13000 30 2 |
38983. | 2051 1996 64 4TH YEAR COLLEGE 55000 32 2 |
38984. | 2051 1996 64 11TH GRADE 15000 32 2 |
38985. | 2051 1998 64 11TH GRADE 17751 34 2 |
|------------------------------------------------------------------------|
38986. | 2051 1998 64 4TH YEAR COLLEGE 60000 34 2 |
38987. | 2051 2000 64 . . 36 2 |
38988. | 2051 2000 64 11TH GRADE 30000 36 2 |
123711. | 6512 1980 57 12TH GRADE 0 23 2 |
123712. | 6512 1980 57 12TH GRADE 2900 24 2 |
+------------------------------------------------------------------------+
.
The remaining observations are not duplicates and suggest different people were assigned the same id
by accident. For real work you might investigate further, including contacting the data provider, but for now you can eliminate all the problem observations with:
drop if copies>1
(40 observations deleted)
Now you could run a one-to-one merge using id
and year
:
duplicates report id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 240994 0
--------------------------------------
12.6 Learning More
clear
sysuse auto
gen company = word(make, 1)
gen model = subinstr(make, company + " ", "", 1)
list make company model
. clear
. sysuse auto
(1978 automobile data)
. gen company = word(make, 1)
. gen model = subinstr(make, company + " ", "", 1)
. list make company model
+-------------------------------------------+
| make company model |
|-------------------------------------------|
1. | AMC Concord AMC Concord |
2. | AMC Pacer AMC Pacer |
3. | AMC Spirit AMC Spirit |
4. | Buick Century Buick Century |
5. | Buick Electra Buick Electra |
|-------------------------------------------|
6. | Buick LeSabre Buick LeSabre |
7. | Buick Opel Buick Opel |
8. | Buick Regal Buick Regal |
9. | Buick Riviera Buick Riviera |
10. | Buick Skylark Buick Skylark |
|-------------------------------------------|
11. | Cad. Deville Cad. Deville |
12. | Cad. Eldorado Cad. Eldorado |
13. | Cad. Seville Cad. Seville |
14. | Chev. Chevette Chev. Chevette |
15. | Chev. Impala Chev. Impala |
|-------------------------------------------|
16. | Chev. Malibu Chev. Malibu |
17. | Chev. Monte Carlo Chev. Monte Carlo |
18. | Chev. Monza Chev. Monza |
19. | Chev. Nova Chev. Nova |
20. | Dodge Colt Dodge Colt |
|-------------------------------------------|
21. | Dodge Diplomat Dodge Diplomat |
22. | Dodge Magnum Dodge Magnum |
23. | Dodge St. Regis Dodge St. Regis |
24. | Ford Fiesta Ford Fiesta |
25. | Ford Mustang Ford Mustang |
|-------------------------------------------|
26. | Linc. Continental Linc. Continental |
27. | Linc. Mark V Linc. Mark V |
28. | Linc. Versailles Linc. Versailles |
29. | Merc. Bobcat Merc. Bobcat |
30. | Merc. Cougar Merc. Cougar |
|-------------------------------------------|
31. | Merc. Marquis Merc. Marquis |
32. | Merc. Monarch Merc. Monarch |
33. | Merc. XR-7 Merc. XR-7 |
34. | Merc. Zephyr Merc. Zephyr |
35. | Olds 98 Olds 98 |
|-------------------------------------------|
36. | Olds Cutl Supr Olds Cutl Supr |
37. | Olds Cutlass Olds Cutlass |
38. | Olds Delta 88 Olds Delta 88 |
39. | Olds Omega Olds Omega |
40. | Olds Starfire Olds Starfire |
|-------------------------------------------|
41. | Olds Toronado Olds Toronado |
42. | Plym. Arrow Plym. Arrow |
43. | Plym. Champ Plym. Champ |
44. | Plym. Horizon Plym. Horizon |
45. | Plym. Sapporo Plym. Sapporo |
|-------------------------------------------|
46. | Plym. Volare Plym. Volare |
47. | Pont. Catalina Pont. Catalina |
48. | Pont. Firebird Pont. Firebird |
49. | Pont. Grand Prix Pont. Grand Prix |
50. | Pont. Le Mans Pont. Le Mans |
|-------------------------------------------|
51. | Pont. Phoenix Pont. Phoenix |
52. | Pont. Sunbird Pont. Sunbird |
53. | Audi 5000 Audi 5000 |
54. | Audi Fox Audi Fox |
55. | BMW 320i BMW 320i |
|-------------------------------------------|
56. | Datsun 200 Datsun 200 |
57. | Datsun 210 Datsun 210 |
58. | Datsun 510 Datsun 510 |
59. | Datsun 810 Datsun 810 |
60. | Fiat Strada Fiat Strada |
|-------------------------------------------|
61. | Honda Accord Honda Accord |
62. | Honda Civic Honda Civic |
63. | Mazda GLC Mazda GLC |
64. | Peugeot 604 Peugeot 604 |
65. | Renault Le Car Renault Le Car |
|-------------------------------------------|
66. | Subaru Subaru Subaru |
67. | Toyota Celica Toyota Celica |
68. | Toyota Corolla Toyota Corolla |
69. | Toyota Corona Toyota Corona |
70. | VW Dasher VW Dasher |
|-------------------------------------------|
71. | VW Diesel VW Diesel |
72. | VW Rabbit VW Rabbit |
73. | VW Scirocco VW Scirocco |
74. | Volvo 260 Volvo 260 |
+-------------------------------------------+
.