Stat209/Ed260 D Rogosa 2/17/16 Solutions Assignment 6. Instrumental variable methods, simultaneous equations -------------------------- see also Lab3 IV exercises ------------------------ Problem 1. Recreate the IV artificial "measurement error" data demo from class #annotations, details etc are on the class handout Here we do the data generation in a more serious/formal way using mvrnorm to get exact sample moments The observables are: outcome y, predictor x, potential instrument z Also x and z are "paralell forms", errorful measures of the same latent variable w [ x = w + e, z = w +e'] w has mean 10, var 4. cor(y, w) = sqrt[2/3]. var(w)/var(x) = var(w)/var(z) = .9. cor(x,z) = .9, cor(x,w) = cor(w,z) = sqrt[.9] All this indicates the means and covariance matrix shown in the data generation Demonstration is a simple application of a valid instrument z to correct for the effects of measurement error in x in the estimation of the y on w slope. > library(MASS) > covm= matrix(c(6,4,4,4,4,4.5,4,4,4,4,4.5,4,4,4,4,4),4,4) #there are the parameter values that were implemented in the example #variables "y","x", "z","w" > covm [,1] [,2] [,3] [,4] [1,] 6 4.0 4.0 4 [2,] 4 4.5 4.0 4 [3,] 4 4.0 4.5 4 [4,] 4 4.0 4.0 4 > ivdat = mvrnorm(n=1000, c(10,10,10,10), covm, empirical = TRUE) #make the data with exact match to moments > ivdatf = as.data.frame(ivdat) #make this a R-data frame so I can name variables > head(ivdatf) V1 V2 V3 V4 1 17.560684 18.208328 14.928280 16.176263 2 8.663935 9.600583 9.342655 9.050532 3 4.564421 5.968721 6.320625 5.823634 4 5.730969 5.595494 6.206201 6.277184 5 7.911436 8.011063 7.589679 8.780076 6 4.605428 5.359033 5.208565 5.338294 > names(ivdatf) = c("y","x", "z","w") > head(ivdatf) y x z w 1 17.560684 18.208328 14.928280 16.176263 2 8.663935 9.600583 9.342655 9.050532 3 4.564421 5.968721 6.320625 5.823634 4 5.730969 5.595494 6.206201 6.277184 5 7.911436 8.011063 7.589679 8.780076 6 4.605428 5.359033 5.208565 5.338294 > attach(ivdatf) # some gratitious checks on values > var(w) [1] 4 > cor(w,y) [1] 0.8164966 > var(w)/var(x) [1] 0.8888889 > var(w)/var(z) [1] 0.8888889 > cor(x,z) [1] 0.8888889 > truereg = lm(y~w) > summary(truereg) Call: lm(formula = y ~ w) Residuals: Min 1Q Median 3Q Max -3.7763 -0.9661 -0.0685 0.9429 4.0620 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.595e-14 2.283e-01 0.00 1 w 1.000e+00 2.238e-02 44.68 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.415 on 998 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6663 F-statistic: 1996 on 1 and 998 DF, p-value: < 2.2e-16 > obsreg = lm(y~x) > summary(obsreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -4.1028 -1.0556 -0.0587 1.0730 4.1880 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.11111 0.23849 4.659 3.61e-06 *** x 0.88889 0.02333 38.100 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.564 on 998 degrees of freedom Multiple R-squared: 0.5926, Adjusted R-squared: 0.5922 F-statistic: 1452 on 1 and 998 DF, p-value: < 2.2e-16 ## y on x slope biased downward by measurement error (attenuation week 1) > cov(y,z)/cov(x,z) # IV estimate using z as instrument for x [1] 1 > xonzreg = lm(x~z) #2SLS by hand > tslsreg = lm(y ~ fitted(xonzreg)) > summary(tslsreg) Call: lm(formula = y ~ fitted(xonzreg)) Residuals: Min 1Q Median 3Q Max -4.5267 -1.0635 -0.0614 0.9779 5.1209 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.977e-14 2.671e-01 0.0 1 fitted(xonzreg) 1.000e+00 2.625e-02 38.1 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.564 on 998 degrees of freedom Multiple R-squared: 0.5926, Adjusted R-squared: 0.5922 F-statistic: 1452 on 1 and 998 DF, p-value: < 2.2e-16 ##IV estimate gets us back to slope 1.0 > summary(tsls(y~x, ~z)) # use sem package, or use ivreg from AER 2SLS Estimates Model Formula: y ~ x Instruments: ~z Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -4.340 -1.080 -0.052 0.000 1.070 4.460 Estimate Std. Error t value Pr(>|t|) (Intercept) -6.253e-13 0.27010 -2.315e-12 1 x 1.000e+00 0.02654 3.767e+01 0 Residual standard error: 1.5819 on 998 degrees of freedom > # details, interpretation on class handout ------------------------ Problem 2 Duncan Haller Portes occupational aspiration [note: IV for simultineaty bias is not a point-of-emphasis for our tour] > dunccor = matrix(nrow = 6, ncol = 6, c(1,.222,.1861,.3355,.4105,.2598,.222,1,.2707,.2302,.3240,.2786, + .1861,.2707,1,.2950,.293,.3607,.3355,.2302,.2950,1,.2995,.5007,.4105,.3240,.2930,.2995,1,.4216, + .2598,.2786,.3607,.5007,.4216,1)) > dunccor [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1.0000 0.2220 0.1861 0.3355 0.4105 0.2598 [2,] 0.2220 1.0000 0.2707 0.2302 0.3240 0.2786 [3,] 0.1861 0.2707 1.0000 0.2950 0.2930 0.3607 [4,] 0.3355 0.2302 0.2950 1.0000 0.2995 0.5007 [5,] 0.4105 0.3240 0.2930 0.2995 1.0000 0.4216 [6,] 0.2598 0.2786 0.3607 0.5007 0.4216 1.0000 make artificial data matching empirical report in study > duncdatemp = mvrnorm(329, c(0,0,0,0,0,0),dunccor, empirical = TRUE) > cor(duncdatemp) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1.0000 0.2220 0.1861 0.3355 0.4105 0.2598 [2,] 0.2220 1.0000 0.2707 0.2302 0.3240 0.2786 [3,] 0.1861 0.2707 1.0000 0.2950 0.2930 0.3607 [4,] 0.3355 0.2302 0.2950 1.0000 0.2995 0.5007 [5,] 0.4105 0.3240 0.2930 0.2995 1.0000 0.4216 [6,] 0.2598 0.2786 0.3607 0.5007 0.4216 1.0000 > library(sem) recreate class handout 2SLS for path diagram > hndout = tsls(duncdatemp[,6] ~ duncdatemp[,3] + duncdatemp[,4] + duncdatemp[,5], ~ duncdatemp[,1] + + duncdatemp[,2] + duncdatemp[,3] + duncdatemp[,4]) > summary(hndout) 2SLS Estimates Model Formula: duncdatemp[, 6] ~ duncdatemp[, 3] + duncdatemp[, 4] + duncdatemp[, 5] Instruments: ~duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 3] + duncdatemp[, 4] Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -2.59e+00 -5.33e-01 3.24e-02 -7.26e-18 5.78e-01 2.37e+00 Estimate Std. Error t value Pr(>|t|) (Intercept) 1.879e-17 0.04457 4.216e-16 1.000e+00 duncdatemp[, 3] 1.567e-01 0.05445 2.877e+00 4.278e-03 duncdatemp[, 4] 3.521e-01 0.05505 6.396e+00 5.554e-10 duncdatemp[, 5] 3.419e-01 0.12478 2.740e+00 6.484e-03 Residual standard error: 0.8084 on 325 degrees of freedom > hndout2 = tsls(duncdatemp[,5] ~ duncdatemp[,1] + duncdatemp[,2] + duncdatemp[,6], ~ duncdatemp[,1] + + duncdatemp[,2] + duncdatemp[,3] + duncdatemp[,4]) > summary(hndout2) 2SLS Estimates Model Formula: duncdatemp[, 5] ~ duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 6] Instruments: ~duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 3] + duncdatemp[, 4] Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -2.43e+00 -5.71e-01 -8.53e-02 -6.07e-18 5.89e-01 2.13e+00 Estimate Std. Error t value Pr(>|t|) (Intercept) -2.043e-17 0.04658 -4.386e-16 1.000e+00 duncdatemp[, 1] 2.721e-01 0.05255 5.179e+00 3.923e-07 duncdatemp[, 2] 1.512e-01 0.05364 2.819e+00 5.113e-03 duncdatemp[, 6] 4.034e-01 0.10431 3.867e+00 1.330e-04 Residual standard error: 0.8449 on 325 degrees of freedom > # most variance in aspiration outside the model ---------------------- ASIDE former additional exercise (left here for possible interest) > # problem 3 asks you to add a path in the variable 5 prediction from var3 > hndout2mod = tsls(duncdatemp[,5] ~ duncdatemp[,1] + duncdatemp[,2] + duncdatemp[,6] + duncdatemp[,3], ~ duncdatemp[,1] + + duncdatemp[,2] + duncdatemp[,3] + duncdatemp[,4]) > summary(hndout2mod) 2SLS Estimates Model Formula: duncdatemp[, 5] ~ duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 6] + duncdatemp[, 3] Instruments: ~duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 3] + duncdatemp[, 4] Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -2.31e+00 -5.93e-01 -9.05e-02 -1.69e-19 6.08e-01 2.16e+00 Estimate Std. Error t value Pr(>|t|) (Intercept) -1.639e-17 0.04604 -3.560e-16 1.000e+00 duncdatemp[, 1] 2.855e-01 0.05260 5.427e+00 1.124e-07 duncdatemp[, 2] 1.570e-01 0.05314 2.955e+00 3.359e-03 duncdatemp[, 6] 2.773e-01 0.12975 2.137e+00 3.331e-02 duncdatemp[, 3] 9.733e-02 0.06083 1.600e+00 1.106e-01 Residual standard error: 0.8351 on 324 degrees of freedom > # raises standard error on the endogenous var[6] prediction and lowers coeff > # endogenous var representing the causal loop still significant but less so > # fit barely improved > END additional exercise ------------------------------ Problem 3. simulation described in Freedman text (note: page 199 revised version) bottom p.191-top p.192 scan of pages 191-192 at http://www-stat.stanford.edu/~rag/stat209/dafp191.pdf > library(MASS) > ?mvrnorm # shows you the help file; easiest way to create artificial data > # Freedman pp191-2, simulate 100 obs make .3 correlation delta epsilon > zeddelteps = mvrnorm(n=100, mu = c(0,0,0), Sigma= + matrix(nrow = 3, ncol = 3, data = c(3,0,0,0,1,.3,0,.3,1), byrow = T), empirical = TRUE) > cor(zeddelteps) [,1] [,2] [,3] [1,] 1.000000e+00 2.661070e-16 -3.463923e-16 [2,] 2.661070e-16 1.000000e+00 3.000000e-01 [3,] -3.463923e-16 3.000000e-01 1.000000e+00 > # could also just take what I get from mvrnorm to get close go up to 400 at least > xsim = .75*zeddelteps[,1] + zeddelteps[,2] # C = .75 set > cor(xsim, zeddelteps[,1]) [1] 0.7924058 > ysim = .5*xsim + zeddelteps[,3] # beta = .5 set > cor(ysim,xsim) [1] 0.7140388 > ols = lm(ysim ~ xsim) > summary(ols) Call: lm(formula = ysim ~ xsim) Residuals: Min 1Q Median 3Q Max -3.8382 -0.4933 0.0961 0.6177 1.8380 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.542e-16 9.881e-02 -2.57e-15 1 xsim 6.116e-01 6.058e-02 10.10 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.9881 on 98 degrees of freedom Multiple R-Squared: 0.5099, Adjusted R-squared: 0.5048 F-statistic: 101.9 on 1 and 98 DF, p-value: < 2.2e-16 > .1116/sqrt(.06058) # so we have a bit of bias from this correl [1] 0.4534188 > cor(xsim,zeddelteps[,3]) [1] 0.1829983 > cov(ysim, zeddelteps[,1])/cov(xsim, zeddelteps[,1]) #IV [1] 0.5 > # so the IV estimate gets beta > # do 2SLS version of IV > xpred = lm(xsim ~ zeddelteps[,1]) > stage2 = lm(ysim ~ fitted(xpred)) > summary(stage2) Call: lm(formula = ysim ~ fitted(xpred)) Residuals: Min 1Q Median 3Q Max -3.2864 -0.9196 0.1151 0.8456 3.3366 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.441e-17 1.251e-01 -3.55e-16 1 fitted(xpred) 5.000e-01 9.681e-02 5.165 1.27e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.251 on 98 degrees of freedom Multiple R-Squared: 0.2139, Adjusted R-squared: 0.2059 F-statistic: 26.67 on 1 and 98 DF, p-value: 1.266e-06 matches IV > # even in this benign case IV via 2SLS has 50% larger standard error, > # enough to offset the (slight) bias of OLS??? ============================== ============================== END solutions HW6 2016