6  Instrumental Variables

Instrumental variables tries to fix the problem of endogenous variables by predicting them using variables that don’t predict the outcome of interest. Good luck finding them!

Card wants to predict wages using education, but suspects both education and wages are affected by ability. So he uses and indicator for having a four-year college in the county as an instrument for education. The assumption is that living near a college makes people more likely to attend college, but does not affect wages. What do you think?

clear
use https://sscc.wisc.edu/~rdimond/pa871/card

First, ignore the issue and run OLS.

reg lwage educ exper black south married smsa

      Source |       SS           df       MS      Number of obs   =     3,003
-------------+----------------------------------   F(6, 2996)      =    219.15
       Model |  180.255137         6  30.0425229   Prob > F        =    0.0000
    Residual |  410.705979     2,996  .137084773   R-squared       =    0.3050
-------------+----------------------------------   Adj R-squared   =    0.3036
       Total |  590.961117     3,002  .196855802   Root MSE        =    .37025

------------------------------------------------------------------------------
       lwage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .0711729   .0034824    20.44   0.000     .0643447     .078001
       exper |   .0341518   .0022144    15.42   0.000     .0298098    .0384938
       black |  -.1660274   .0176137    -9.43   0.000    -.2005636   -.1314913
       south |  -.1315518   .0149691    -8.79   0.000    -.1609024   -.1022011
     married |  -.0358707   .0034012   -10.55   0.000    -.0425396   -.0292019
        smsa |   .1757871   .0154578    11.37   0.000     .1454782    .2060961
       _cons |   5.063317   .0637402    79.44   0.000     4.938338    5.188296
------------------------------------------------------------------------------

Now let’s do IV regression “by hand.” First, regress educ on nearc4 and the other predictors from the model. THis is the first stage regression.

reg educ nearc4 exper black south married smsa

      Source |       SS           df       MS      Number of obs   =     3,003
-------------+----------------------------------   F(6, 2996)      =    456.14
       Model |  10272.0963         6  1712.01605   Prob > F        =    0.0000
    Residual |  11244.7835     2,996  3.75326552   R-squared       =    0.4774
-------------+----------------------------------   Adj R-squared   =    0.4764
       Total |  21516.8798     3,002  7.16751492   Root MSE        =    1.9373

------------------------------------------------------------------------------
        educ | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      nearc4 |   .3272826   .0824239     3.97   0.000     .1656695    .4888957
       exper |   -.404434   .0089402   -45.24   0.000    -.4219636   -.3869044
       black |  -.9475281   .0905256   -10.47   0.000    -1.125027   -.7700295
       south |  -.2973528   .0790643    -3.76   0.000    -.4523787   -.1423269
     married |  -.0726936   .0177473    -4.10   0.000    -.1074918   -.0378954
        smsa |   .4208945    .084868     4.96   0.000     .2544891       .5873
       _cons |    16.8307   .1307475   128.73   0.000     16.57433    17.08706
------------------------------------------------------------------------------

You need to be sure that the instrument is relevant. Since you only have one you could look at the p-value, but with more than one you’d do a joint test.

test nearc4

 ( 1)  nearc4 = 0

       F(  1,  2996) =   15.77
            Prob > F =    0.0001

You’ve also got a decent R-squared, so this is promising. Next, get the predicted values.

predict educ_hat
(option xb assumed; fitted values)
(7 missing values generated)

Now use educ_hat instead of edu in the original regression.

reg lwage educ_hat exper black south married smsa

      Source |       SS           df       MS      Number of obs   =     3,003
-------------+----------------------------------   F(6, 2996)      =    132.47
       Model |  123.906397         6  20.6510662   Prob > F        =    0.0000
    Residual |   467.05472     2,996  .155892764   R-squared       =    0.2097
-------------+----------------------------------   Adj R-squared   =    0.2081
       Total |  590.961117     3,002  .196855802   Root MSE        =    .39483

------------------------------------------------------------------------------
       lwage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    educ_hat |   .1241645   .0513261     2.42   0.016     .0235265    .2248026
       exper |   .0555883   .0208426     2.67   0.008     .0147212    .0964555
       black |  -.1156853   .0521334    -2.22   0.027    -.2179061   -.0134644
       south |  -.1131646   .0238815    -4.74   0.000    -.1599903   -.0663388
     married |  -.0319754   .0052264    -6.12   0.000    -.0422231   -.0217276
        smsa |   .1477063   .0317426     4.65   0.000     .0854668    .2099459
       _cons |   4.162471   .8728954     4.77   0.000     2.450936    5.874006
------------------------------------------------------------------------------

The estimate effect of education is now bigger, which supports using IV regression but doesn’t match the story that the problem is ability.

The standard errors we have here are wrong, because they don’t take into account the uncertainty in educ_hat. To get the right values, use ivregress.

ivregress 2sls lwage (educ=nearc4) exper black south married smsa, first 

First-stage regressions
-----------------------

                                                        Number of obs =  3,003
                                                        F(6, 2996)    = 456.14
                                                        Prob > F      = 0.0000
                                                        R-squared     = 0.4774
                                                        Adj R-squared = 0.4764
                                                        Root MSE      = 1.9373

------------------------------------------------------------------------------
        educ | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       exper |   -.404434   .0089402   -45.24   0.000    -.4219636   -.3869044
       black |  -.9475281   .0905256   -10.47   0.000    -1.125027   -.7700295
       south |  -.2973528   .0790643    -3.76   0.000    -.4523787   -.1423269
     married |  -.0726936   .0177473    -4.10   0.000    -.1074918   -.0378954
        smsa |   .4208945    .084868     4.96   0.000     .2544891       .5873
      nearc4 |   .3272826   .0824239     3.97   0.000     .1656695    .4888957
       _cons |    16.8307   .1307475   128.73   0.000     16.57433    17.08706
------------------------------------------------------------------------------


Instrumental-variables 2SLS regression            Number of obs   =      3,003
                                                  Wald chi2(6)    =     840.98
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.2513
                                                  Root MSE        =     .38384

------------------------------------------------------------------------------
       lwage | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .1241642   .0498975     2.49   0.013     .0263668    .2219616
       exper |   .0555882   .0202624     2.74   0.006     .0158746    .0953019
       black |  -.1156855   .0506823    -2.28   0.022    -.2150211     -.01635
       south |  -.1131647   .0232168    -4.87   0.000    -.1586687   -.0676607
     married |  -.0319754    .005081    -6.29   0.000    -.0419339   -.0220169
        smsa |   .1477065   .0308591     4.79   0.000     .0872237    .2081893
       _cons |   4.162476   .8485997     4.91   0.000     2.499251    5.825701
------------------------------------------------------------------------------
Endogenous: educ
Exogenous:  exper black south married smsa nearc4

Having run this with Stata’s official command for IV regression, it can also give you diagnostics. Start with tests of our first stage regression, the one that predicts educ.

estat firststage

  First-stage regression summary statistics
  --------------------------------------------------------------------------
               |            Adjusted      Partial
      Variable |   R-sq.       R-sq.        R-sq.     F(1,2996)   Prob > F
  -------------+------------------------------------------------------------
          educ |  0.4774      0.4764       0.0052       15.7667    0.0001
  --------------------------------------------------------------------------


  Minimum eigenvalue statistic = 15.7667     

  Critical Values                      # of endogenous regressors:    1
  H0: Instruments are weak             # of excluded instruments:     1
  ---------------------------------------------------------------------
                                     |    5%     10%     20%     30%
  2SLS relative bias                 |         (not available)
  -----------------------------------+---------------------------------
                                     |   10%     15%     20%     25%
  2SLS size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
  LIML size of nominal 5% Wald test  |  16.38    8.96    6.66    5.53
  ---------------------------------------------------------------------

The Partial R-sq tells us how much of the variation in educ is explained by the instrument nearc4. Not a lot! The critical values tell us that our 5% rejection rate for tests is actually over 10% because the minimum eigenvalue statistic is less than 16.38.

So let’s add fatheduc and motheduc as instruments, on the theory that parent’s education predicts the child’s education but not wages. (Maybe…)

ivregress 2sls lwage (educ=nearc4 fatheduc motheduc) exper black south married smsa, first 

First-stage regressions
-----------------------

                                                        Number of obs =  2,215
                                                        F(8, 2206)    = 258.94
                                                        Prob > F      = 0.0000
                                                        R-squared     = 0.4843
                                                        Adj R-squared = 0.4824
                                                        Root MSE      = 1.8611

------------------------------------------------------------------------------
        educ | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       exper |  -.3419814   .0111778   -30.59   0.000    -.3639016   -.3200613
       black |  -.3098986   .1196026    -2.59   0.010    -.5444441   -.0753532
       south |  -.0988176   .0883931    -1.12   0.264    -.2721599    .0745247
     married |  -.0687513    .019922    -3.45   0.001    -.1078192   -.0296834
        smsa |   .2950191   .0961899     3.07   0.002     .1063868    .4836514
      nearc4 |   .1995035   .0926045     2.15   0.031     .0179023    .3811047
    fatheduc |   .1104177   .0145384     7.59   0.000     .0819072    .1389282
    motheduc |   .1314276   .0170012     7.73   0.000     .0980876    .1647676
       _cons |   13.84753   .2409811    57.46   0.000     13.37496     14.3201
------------------------------------------------------------------------------


Instrumental-variables 2SLS regression            Number of obs   =      2,215
                                                  Wald chi2(6)    =     579.08
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.2604
                                                  Root MSE        =      .3778

------------------------------------------------------------------------------
       lwage | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .1031677   .0125667     8.21   0.000     .0785375    .1277979
       exper |    .049754   .0054989     9.05   0.000     .0389763    .0605317
       black |  -.1285532   .0256141    -5.02   0.000    -.1787559   -.0783506
       south |  -.1135096   .0179675    -6.32   0.000    -.1487253   -.0782939
     married |   -.034256    .004119    -8.32   0.000    -.0423291   -.0261829
        smsa |   .1597245   .0195485     8.17   0.000     .1214102    .1980387
       _cons |   4.490855   .2151892    20.87   0.000     4.069092    4.912618
------------------------------------------------------------------------------
Endogenous: educ
Exogenous:  exper black south married smsa nearc4 fatheduc motheduc
estat firststage

  First-stage regression summary statistics
  --------------------------------------------------------------------------
               |            Adjusted      Partial
      Variable |   R-sq.       R-sq.        R-sq.     F(3,2206)   Prob > F
  -------------+------------------------------------------------------------
          educ |  0.4843      0.4824       0.1058       86.9839    0.0000
  --------------------------------------------------------------------------


  Minimum eigenvalue statistic = 86.9839     

  Critical Values                      # of endogenous regressors:    1
  H0: Instruments are weak             # of excluded instruments:     3
  ---------------------------------------------------------------------
                                     |    5%     10%     20%     30%
  2SLS relative bias                 |  13.91    9.08    6.46    5.39
  -----------------------------------+---------------------------------
                                     |   10%     15%     20%     25%
  2SLS size of nominal 5% Wald test  |  22.30   12.83    9.54    7.80
  LIML size of nominal 5% Wald test  |   6.46    4.36    3.69    3.32
  ---------------------------------------------------------------------

Note how the overall R-squared barely budged, but the partial R-Squared increased by a lot. Parental education explains some of the same variation in educ as other variables, but now that part is excluded from the second stage. The result is a much higher minimum eigenvalue that is well above the critical values. Note too that we now have critical values for relative bias, tell us how the bias introduced by using IV compares with the bias we get by ignoring it in OLS.

If our instruments were weak, we’d need to use tests that account for that. We can check for that–rejecting means they are not.

estat weakrobust

Test robust to weak instruments
Model VCE: Unadjusted

 ( 1)  educ = 0

Cond. likelihood-ratio (CLR) test =  61.65
                       Prob > CLR = 0.0000

Note: CLR test reported by default because
      model is overidentified.

The test says they are not. We can get the adjusted confidence intervals anyway, but they’re aren’t noticeably different.

estat weakrobust, ci

Confidence interval robust to weak instruments
Model VCE: Unadjusted

--------------------------------------------------
             |                        CLR
             | Coefficient    [95% conf. interval]
-------------+------------------------------------
        educ |   .1031677     .0789756    .1290319
--------------------------------------------------
Note: CLR CI reported by default because model is
      overidentified.

We can also test if the residuals are correlated with the instruments. That would be bad (the instruments are invalid or the second stage model is misspecified) so we do not want to reject here.

estat overid

  Tests of overidentifying restrictions:

  Sargan (score) chi2(2) =  2.52743  (p = 0.2826)
  Basmann chi2(2)        =  2.52004  (p = 0.2836)

Finally, we can test whether educ is endogenous after all. Rejecting means that yes, we need to do IV.

estat endog

  Tests of endogeneity
  H0: Variables are exogenous

  Durbin (score) chi2(1)          =   6.4691  (p = 0.0110)
  Wu-Hausman F(1,2207)            =  6.46461  (p = 0.0111)

Keep in mind these are the tests that make Nick Huntington-Klein grumpy, for good reasons (which he explains in The Effect).

There are alternatives to 2SLS, and they’re easy to use. The differences are usually small.

ivregress liml lwage (educ=nearc4 fatheduc motheduc) exper black south married smsa

Instrumental-variables LIML regression            Number of obs   =      2,215
                                                  Wald chi2(6)    =     578.54
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.2601
                                                  Root MSE        =      .3779

------------------------------------------------------------------------------
       lwage | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .1034932   .0126311     8.19   0.000     .0787367    .1282498
       exper |   .0498853   .0055231     9.03   0.000     .0390602    .0607104
       black |  -.1282741   .0256429    -5.00   0.000    -.1785333   -.0780149
       south |  -.1134166   .0179758    -6.31   0.000    -.1486485   -.0781847
     married |  -.0342356   .0041208    -8.31   0.000    -.0423123   -.0261589
        smsa |   .1595524   .0195646     8.16   0.000     .1212064    .1978983
       _cons |   4.485329   .2162747    20.74   0.000     4.061439     4.90922
------------------------------------------------------------------------------
Endogenous: educ
Exogenous:  exper black south married smsa nearc4 fatheduc motheduc
ivregress gmm lwage (educ=nearc4 fatheduc motheduc) exper black south married smsa 

Instrumental-variables GMM regression             Number of obs   =      2,215
                                                  Wald chi2(6)    =     610.27
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.2604
GMM weight matrix: Robust                         Root MSE        =     .37781

------------------------------------------------------------------------------
             |               Robust
       lwage | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        educ |   .1031753   .0129463     7.97   0.000     .0778009    .1285496
       exper |   .0497186   .0055644     8.94   0.000     .0388125    .0606246
       black |  -.1308671   .0258903    -5.05   0.000    -.1816111   -.0801231
       south |  -.1127678   .0179677    -6.28   0.000    -.1479839   -.0775517
     married |  -.0343101   .0042511    -8.07   0.000    -.0426421   -.0259781
        smsa |   .1620557   .0191305     8.47   0.000     .1245606    .1995507
       _cons |    4.48988   .2208802    20.33   0.000     4.056963    4.922798
------------------------------------------------------------------------------
Endogenous: educ
Exogenous:  exper black south married smsa nearc4 fatheduc motheduc