clear
use https://sscc.wisc.edu/~rdimond/pa871/card
6 Instrumental Variables
Instrumental variables tries to fix the problem of endogenous variables by predicting them using variables that don’t predict the outcome of interest. Good luck finding them!
Card wants to predict wages using education, but suspects both education and wages are affected by ability. So he uses and indicator for having a four-year college in the county as an instrument for education. The assumption is that living near a college makes people more likely to attend college, but does not affect wages. What do you think?
First, ignore the issue and run OLS.
reg lwage educ exper black south married smsa
Source | SS df MS Number of obs = 3,003
-------------+---------------------------------- F(6, 2996) = 219.15
Model | 180.255137 6 30.0425229 Prob > F = 0.0000
Residual | 410.705979 2,996 .137084773 R-squared = 0.3050
-------------+---------------------------------- Adj R-squared = 0.3036
Total | 590.961117 3,002 .196855802 Root MSE = .37025
------------------------------------------------------------------------------
lwage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .0711729 .0034824 20.44 0.000 .0643447 .078001
exper | .0341518 .0022144 15.42 0.000 .0298098 .0384938
black | -.1660274 .0176137 -9.43 0.000 -.2005636 -.1314913
south | -.1315518 .0149691 -8.79 0.000 -.1609024 -.1022011
married | -.0358707 .0034012 -10.55 0.000 -.0425396 -.0292019
smsa | .1757871 .0154578 11.37 0.000 .1454782 .2060961
_cons | 5.063317 .0637402 79.44 0.000 4.938338 5.188296
------------------------------------------------------------------------------
Now let’s do IV regression “by hand.” First, regress educ
on nearc4
and the other predictors from the model. THis is the first stage regression.
reg educ nearc4 exper black south married smsa
Source | SS df MS Number of obs = 3,003
-------------+---------------------------------- F(6, 2996) = 456.14
Model | 10272.0963 6 1712.01605 Prob > F = 0.0000
Residual | 11244.7835 2,996 3.75326552 R-squared = 0.4774
-------------+---------------------------------- Adj R-squared = 0.4764
Total | 21516.8798 3,002 7.16751492 Root MSE = 1.9373
------------------------------------------------------------------------------
educ | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
nearc4 | .3272826 .0824239 3.97 0.000 .1656695 .4888957
exper | -.404434 .0089402 -45.24 0.000 -.4219636 -.3869044
black | -.9475281 .0905256 -10.47 0.000 -1.125027 -.7700295
south | -.2973528 .0790643 -3.76 0.000 -.4523787 -.1423269
married | -.0726936 .0177473 -4.10 0.000 -.1074918 -.0378954
smsa | .4208945 .084868 4.96 0.000 .2544891 .5873
_cons | 16.8307 .1307475 128.73 0.000 16.57433 17.08706
------------------------------------------------------------------------------
You need to be sure that the instrument is relevant. Since you only have one you could look at the p-value, but with more than one you’d do a joint test.
test nearc4
( 1) nearc4 = 0
F( 1, 2996) = 15.77
Prob > F = 0.0001
You’ve also got a decent R-squared, so this is promising. Next, get the predicted values.
predict educ_hat
(option xb assumed; fitted values)
(7 missing values generated)
Now use educ_hat
instead of edu
in the original regression.
reg lwage educ_hat exper black south married smsa
Source | SS df MS Number of obs = 3,003
-------------+---------------------------------- F(6, 2996) = 132.47
Model | 123.906397 6 20.6510662 Prob > F = 0.0000
Residual | 467.05472 2,996 .155892764 R-squared = 0.2097
-------------+---------------------------------- Adj R-squared = 0.2081
Total | 590.961117 3,002 .196855802 Root MSE = .39483
------------------------------------------------------------------------------
lwage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
educ_hat | .1241645 .0513261 2.42 0.016 .0235265 .2248026
exper | .0555883 .0208426 2.67 0.008 .0147212 .0964555
black | -.1156853 .0521334 -2.22 0.027 -.2179061 -.0134644
south | -.1131646 .0238815 -4.74 0.000 -.1599903 -.0663388
married | -.0319754 .0052264 -6.12 0.000 -.0422231 -.0217276
smsa | .1477063 .0317426 4.65 0.000 .0854668 .2099459
_cons | 4.162471 .8728954 4.77 0.000 2.450936 5.874006
------------------------------------------------------------------------------
The estimate effect of education is now bigger, which supports using IV regression but doesn’t match the story that the problem is ability.
The standard errors we have here are wrong, because they don’t take into account the uncertainty in educ_hat
. To get the right values, use ivregress
.
black south married smsa, first ivregress 2sls lwage (educ=nearc4) exper
First-stage regressions
-----------------------
Number of obs = 3,003
F(6, 2996) = 456.14
Prob > F = 0.0000
R-squared = 0.4774
Adj R-squared = 0.4764
Root MSE = 1.9373
------------------------------------------------------------------------------
educ | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
exper | -.404434 .0089402 -45.24 0.000 -.4219636 -.3869044
black | -.9475281 .0905256 -10.47 0.000 -1.125027 -.7700295
south | -.2973528 .0790643 -3.76 0.000 -.4523787 -.1423269
married | -.0726936 .0177473 -4.10 0.000 -.1074918 -.0378954
smsa | .4208945 .084868 4.96 0.000 .2544891 .5873
nearc4 | .3272826 .0824239 3.97 0.000 .1656695 .4888957
_cons | 16.8307 .1307475 128.73 0.000 16.57433 17.08706
------------------------------------------------------------------------------
Instrumental-variables 2SLS regression Number of obs = 3,003
Wald chi2(6) = 840.98
Prob > chi2 = 0.0000
R-squared = 0.2513
Root MSE = .38384
------------------------------------------------------------------------------
lwage | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .1241642 .0498975 2.49 0.013 .0263668 .2219616
exper | .0555882 .0202624 2.74 0.006 .0158746 .0953019
black | -.1156855 .0506823 -2.28 0.022 -.2150211 -.01635
south | -.1131647 .0232168 -4.87 0.000 -.1586687 -.0676607
married | -.0319754 .005081 -6.29 0.000 -.0419339 -.0220169
smsa | .1477065 .0308591 4.79 0.000 .0872237 .2081893
_cons | 4.162476 .8485997 4.91 0.000 2.499251 5.825701
------------------------------------------------------------------------------
Endogenous: educ
Exogenous: exper black south married smsa nearc4
Having run this with Stata’s official command for IV regression, it can also give you diagnostics. Start with tests of our first stage regression, the one that predicts educ
.
estat firststage
First-stage regression summary statistics
--------------------------------------------------------------------------
| Adjusted Partial
Variable | R-sq. R-sq. R-sq. F(1,2996) Prob > F
-------------+------------------------------------------------------------
educ | 0.4774 0.4764 0.0052 15.7667 0.0001
--------------------------------------------------------------------------
Minimum eigenvalue statistic = 15.7667
Critical Values # of endogenous regressors: 1
H0: Instruments are weak # of excluded instruments: 1
---------------------------------------------------------------------
| 5% 10% 20% 30%
2SLS relative bias | (not available)
-----------------------------------+---------------------------------
| 10% 15% 20% 25%
2SLS size of nominal 5% Wald test | 16.38 8.96 6.66 5.53
LIML size of nominal 5% Wald test | 16.38 8.96 6.66 5.53
---------------------------------------------------------------------
The Partial R-sq tells us how much of the variation in educ
is explained by the instrument nearc4
. Not a lot! The critical values tell us that our 5% rejection rate for tests is actually over 10% because the minimum eigenvalue statistic is less than 16.38.
So let’s add fatheduc
and motheduc
as instruments, on the theory that parent’s education predicts the child’s education but not wages. (Maybe…)
black south married smsa, first ivregress 2sls lwage (educ=nearc4 fatheduc motheduc) exper
First-stage regressions
-----------------------
Number of obs = 2,215
F(8, 2206) = 258.94
Prob > F = 0.0000
R-squared = 0.4843
Adj R-squared = 0.4824
Root MSE = 1.8611
------------------------------------------------------------------------------
educ | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
exper | -.3419814 .0111778 -30.59 0.000 -.3639016 -.3200613
black | -.3098986 .1196026 -2.59 0.010 -.5444441 -.0753532
south | -.0988176 .0883931 -1.12 0.264 -.2721599 .0745247
married | -.0687513 .019922 -3.45 0.001 -.1078192 -.0296834
smsa | .2950191 .0961899 3.07 0.002 .1063868 .4836514
nearc4 | .1995035 .0926045 2.15 0.031 .0179023 .3811047
fatheduc | .1104177 .0145384 7.59 0.000 .0819072 .1389282
motheduc | .1314276 .0170012 7.73 0.000 .0980876 .1647676
_cons | 13.84753 .2409811 57.46 0.000 13.37496 14.3201
------------------------------------------------------------------------------
Instrumental-variables 2SLS regression Number of obs = 2,215
Wald chi2(6) = 579.08
Prob > chi2 = 0.0000
R-squared = 0.2604
Root MSE = .3778
------------------------------------------------------------------------------
lwage | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .1031677 .0125667 8.21 0.000 .0785375 .1277979
exper | .049754 .0054989 9.05 0.000 .0389763 .0605317
black | -.1285532 .0256141 -5.02 0.000 -.1787559 -.0783506
south | -.1135096 .0179675 -6.32 0.000 -.1487253 -.0782939
married | -.034256 .004119 -8.32 0.000 -.0423291 -.0261829
smsa | .1597245 .0195485 8.17 0.000 .1214102 .1980387
_cons | 4.490855 .2151892 20.87 0.000 4.069092 4.912618
------------------------------------------------------------------------------
Endogenous: educ
Exogenous: exper black south married smsa nearc4 fatheduc motheduc
estat firststage
First-stage regression summary statistics
--------------------------------------------------------------------------
| Adjusted Partial
Variable | R-sq. R-sq. R-sq. F(3,2206) Prob > F
-------------+------------------------------------------------------------
educ | 0.4843 0.4824 0.1058 86.9839 0.0000
--------------------------------------------------------------------------
Minimum eigenvalue statistic = 86.9839
Critical Values # of endogenous regressors: 1
H0: Instruments are weak # of excluded instruments: 3
---------------------------------------------------------------------
| 5% 10% 20% 30%
2SLS relative bias | 13.91 9.08 6.46 5.39
-----------------------------------+---------------------------------
| 10% 15% 20% 25%
2SLS size of nominal 5% Wald test | 22.30 12.83 9.54 7.80
LIML size of nominal 5% Wald test | 6.46 4.36 3.69 3.32
---------------------------------------------------------------------
Note how the overall R-squared barely budged, but the partial R-Squared increased by a lot. Parental education explains some of the same variation in educ
as other variables, but now that part is excluded from the second stage. The result is a much higher minimum eigenvalue that is well above the critical values. Note too that we now have critical values for relative bias, tell us how the bias introduced by using IV compares with the bias we get by ignoring it in OLS.
If our instruments were weak, we’d need to use tests that account for that. We can check for that–rejecting means they are not.
estat weakrobust
Test robust to weak instruments
Model VCE: Unadjusted
( 1) educ = 0
Cond. likelihood-ratio (CLR) test = 61.65
Prob > CLR = 0.0000
Note: CLR test reported by default because
model is overidentified.
The test says they are not. We can get the adjusted confidence intervals anyway, but they’re aren’t noticeably different.
estat weakrobust, ci
Confidence interval robust to weak instruments
Model VCE: Unadjusted
--------------------------------------------------
| CLR
| Coefficient [95% conf. interval]
-------------+------------------------------------
educ | .1031677 .0789756 .1290319
--------------------------------------------------
Note: CLR CI reported by default because model is
overidentified.
We can also test if the residuals are correlated with the instruments. That would be bad (the instruments are invalid or the second stage model is misspecified) so we do not want to reject here.
estat overid
Tests of overidentifying restrictions:
Sargan (score) chi2(2) = 2.52743 (p = 0.2826)
Basmann chi2(2) = 2.52004 (p = 0.2836)
Finally, we can test whether educ
is endogenous after all. Rejecting means that yes, we need to do IV.
estat endog
Tests of endogeneity
H0: Variables are exogenous
Durbin (score) chi2(1) = 6.4691 (p = 0.0110)
Wu-Hausman F(1,2207) = 6.46461 (p = 0.0111)
Keep in mind these are the tests that make Nick Huntington-Klein grumpy, for good reasons (which he explains in The Effect).
There are alternatives to 2SLS, and they’re easy to use. The differences are usually small.
black south married smsa ivregress liml lwage (educ=nearc4 fatheduc motheduc) exper
Instrumental-variables LIML regression Number of obs = 2,215
Wald chi2(6) = 578.54
Prob > chi2 = 0.0000
R-squared = 0.2601
Root MSE = .3779
------------------------------------------------------------------------------
lwage | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .1034932 .0126311 8.19 0.000 .0787367 .1282498
exper | .0498853 .0055231 9.03 0.000 .0390602 .0607104
black | -.1282741 .0256429 -5.00 0.000 -.1785333 -.0780149
south | -.1134166 .0179758 -6.31 0.000 -.1486485 -.0781847
married | -.0342356 .0041208 -8.31 0.000 -.0423123 -.0261589
smsa | .1595524 .0195646 8.16 0.000 .1212064 .1978983
_cons | 4.485329 .2162747 20.74 0.000 4.061439 4.90922
------------------------------------------------------------------------------
Endogenous: educ
Exogenous: exper black south married smsa nearc4 fatheduc motheduc
black south married smsa ivregress gmm lwage (educ=nearc4 fatheduc motheduc) exper
Instrumental-variables GMM regression Number of obs = 2,215
Wald chi2(6) = 610.27
Prob > chi2 = 0.0000
R-squared = 0.2604
GMM weight matrix: Robust Root MSE = .37781
------------------------------------------------------------------------------
| Robust
lwage | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .1031753 .0129463 7.97 0.000 .0778009 .1285496
exper | .0497186 .0055644 8.94 0.000 .0388125 .0606246
black | -.1308671 .0258903 -5.05 0.000 -.1816111 -.0801231
south | -.1127678 .0179677 -6.28 0.000 -.1479839 -.0775517
married | -.0343101 .0042511 -8.07 0.000 -.0426421 -.0259781
smsa | .1620557 .0191305 8.47 0.000 .1245606 .1995507
_cons | 4.48988 .2208802 20.33 0.000 4.056963 4.922798
------------------------------------------------------------------------------
Endogenous: educ
Exogenous: exper black south married smsa nearc4 fatheduc motheduc