Stat209/Ed260 D Rogosa   2/17/16

Solutions Assignment 6.  
Instrumental variable methods, simultaneous equations
--------------------------
  see also Lab3  IV exercises
------------------------


Problem 1.

Recreate the IV artificial "measurement error" data demo from class 
#annotations, details etc are on the class handout 
Here we do the data generation in a more serious/formal way using mvrnorm to get exact sample moments

The observables are: outcome y, predictor x, potential instrument z
Also x and z are "paralell forms", errorful measures of the same latent variable w [ x = w + e, z = w +e']
w has mean 10, var 4. cor(y, w) = sqrt[2/3]. var(w)/var(x) = var(w)/var(z) = .9. cor(x,z) = .9, cor(x,w) = cor(w,z) = sqrt[.9]
All this indicates the means and covariance matrix shown in the data generation

Demonstration is a simple application of a valid instrument z to correct for the effects of measurement error in x in the estimation of the y on w slope.

> library(MASS)

> covm= matrix(c(6,4,4,4,4,4.5,4,4,4,4,4.5,4,4,4,4,4),4,4)  #there are the parameter values that were implemented in the example
#variables "y","x", "z","w"
> covm
     [,1] [,2] [,3] [,4]
[1,]    6  4.0  4.0    4
[2,]    4  4.5  4.0    4
[3,]    4  4.0  4.5    4
[4,]    4  4.0  4.0    4

> ivdat = mvrnorm(n=1000, c(10,10,10,10), covm, empirical = TRUE) #make the data with exact match to moments
> ivdatf = as.data.frame(ivdat) #make this a R-data frame so I can name variables
> head(ivdatf)
         V1        V2        V3        V4
1 17.560684 18.208328 14.928280 16.176263
2  8.663935  9.600583  9.342655  9.050532
3  4.564421  5.968721  6.320625  5.823634
4  5.730969  5.595494  6.206201  6.277184
5  7.911436  8.011063  7.589679  8.780076
6  4.605428  5.359033  5.208565  5.338294
> names(ivdatf) = c("y","x", "z","w")
> head(ivdatf)
          y         x         z         w
1 17.560684 18.208328 14.928280 16.176263
2  8.663935  9.600583  9.342655  9.050532
3  4.564421  5.968721  6.320625  5.823634
4  5.730969  5.595494  6.206201  6.277184
5  7.911436  8.011063  7.589679  8.780076
6  4.605428  5.359033  5.208565  5.338294
> attach(ivdatf)
# some gratitious checks on values
> var(w)
[1] 4
> cor(w,y)
[1] 0.8164966
> var(w)/var(x)
[1] 0.8888889
> var(w)/var(z)
[1] 0.8888889
> cor(x,z)
[1] 0.8888889


> truereg = lm(y~w)
> summary(truereg)

Call:
lm(formula = y ~ w)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7763 -0.9661 -0.0685  0.9429  4.0620 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.595e-14  2.283e-01    0.00        1    
w           1.000e+00  2.238e-02   44.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.415 on 998 degrees of freedom
Multiple R-squared: 0.6667,     Adjusted R-squared: 0.6663 
F-statistic:  1996 on 1 and 998 DF,  p-value: < 2.2e-16 

> obsreg = lm(y~x)
> summary(obsreg)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1028 -1.0556 -0.0587  1.0730  4.1880 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.11111    0.23849   4.659 3.61e-06 ***
x            0.88889    0.02333  38.100  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.564 on 998 degrees of freedom
Multiple R-squared: 0.5926,     Adjusted R-squared: 0.5922 
F-statistic:  1452 on 1 and 998 DF,  p-value: < 2.2e-16 

## y on x slope biased downward by measurement error (attenuation week 1)

> cov(y,z)/cov(x,z)  # IV estimate using z as instrument for x
[1] 1

> xonzreg = lm(x~z)           #2SLS by hand
> tslsreg = lm(y ~ fitted(xonzreg))
> summary(tslsreg)

Call:
lm(formula = y ~ fitted(xonzreg))

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5267 -1.0635 -0.0614  0.9779  5.1209 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.977e-14  2.671e-01     0.0        1    
fitted(xonzreg)  1.000e+00  2.625e-02    38.1   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.564 on 998 degrees of freedom
Multiple R-squared: 0.5926,     Adjusted R-squared: 0.5922 
F-statistic:  1452 on 1 and 998 DF,  p-value: < 2.2e-16 

##IV estimate gets us back to slope 1.0

> summary(tsls(y~x, ~z))  # use sem package, or use ivreg from AER

 2SLS Estimates

Model Formula: y ~ x

Instruments: ~z

Residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -4.340  -1.080  -0.052   0.000   1.070   4.460 

              Estimate Std. Error    t value Pr(>|t|)
(Intercept) -6.253e-13    0.27010 -2.315e-12        1
x            1.000e+00    0.02654  3.767e+01        0

Residual standard error: 1.5819 on 998 degrees of freedom

> 
# details, interpretation on class handout



------------------------

Problem 2  Duncan Haller Portes occupational aspiration
[note: IV for simultineaty bias is not a point-of-emphasis for our tour]

> dunccor = matrix(nrow = 6, ncol = 6, c(1,.222,.1861,.3355,.4105,.2598,.222,1,.2707,.2302,.3240,.2786,
+ .1861,.2707,1,.2950,.293,.3607,.3355,.2302,.2950,1,.2995,.5007,.4105,.3240,.2930,.2995,1,.4216,
+ .2598,.2786,.3607,.5007,.4216,1))

> dunccor
       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]
[1,] 1.0000 0.2220 0.1861 0.3355 0.4105 0.2598
[2,] 0.2220 1.0000 0.2707 0.2302 0.3240 0.2786
[3,] 0.1861 0.2707 1.0000 0.2950 0.2930 0.3607
[4,] 0.3355 0.2302 0.2950 1.0000 0.2995 0.5007
[5,] 0.4105 0.3240 0.2930 0.2995 1.0000 0.4216
[6,] 0.2598 0.2786 0.3607 0.5007 0.4216 1.0000


make artificial data matching empirical report in study

> duncdatemp = mvrnorm(329, c(0,0,0,0,0,0),dunccor, empirical = TRUE)
> cor(duncdatemp)
       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]
[1,] 1.0000 0.2220 0.1861 0.3355 0.4105 0.2598
[2,] 0.2220 1.0000 0.2707 0.2302 0.3240 0.2786
[3,] 0.1861 0.2707 1.0000 0.2950 0.2930 0.3607
[4,] 0.3355 0.2302 0.2950 1.0000 0.2995 0.5007
[5,] 0.4105 0.3240 0.2930 0.2995 1.0000 0.4216
[6,] 0.2598 0.2786 0.3607 0.5007 0.4216 1.0000
> library(sem)

recreate class handout  2SLS for path diagram

> hndout = tsls(duncdatemp[,6] ~ duncdatemp[,3] + duncdatemp[,4] + duncdatemp[,5], ~ duncdatemp[,1] +
+  duncdatemp[,2] + duncdatemp[,3] + duncdatemp[,4])
> summary(hndout)

 2SLS Estimates

Model Formula: duncdatemp[, 6] ~ duncdatemp[, 3] + duncdatemp[, 4] + duncdatemp[, 
    5]

Instruments: ~duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 3] + duncdatemp[, 
    4]

Residuals:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-2.59e+00 -5.33e-01  3.24e-02 -7.26e-18  5.78e-01  2.37e+00 

                 Estimate Std. Error   t value  Pr(>|t|)
(Intercept)     1.879e-17    0.04457 4.216e-16 1.000e+00
duncdatemp[, 3] 1.567e-01    0.05445 2.877e+00 4.278e-03
duncdatemp[, 4] 3.521e-01    0.05505 6.396e+00 5.554e-10
duncdatemp[, 5] 3.419e-01    0.12478 2.740e+00 6.484e-03

Residual standard error: 0.8084 on 325 degrees of freedom

> hndout2 = tsls(duncdatemp[,5] ~ duncdatemp[,1] + duncdatemp[,2] + duncdatemp[,6], ~ duncdatemp[,1] + 
+ duncdatemp[,2] + duncdatemp[,3] + duncdatemp[,4])                                          
> summary(hndout2)

 2SLS Estimates

Model Formula: duncdatemp[, 5] ~ duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 
    6]

Instruments: ~duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 3] + duncdatemp[, 
    4]

Residuals:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-2.43e+00 -5.71e-01 -8.53e-02 -6.07e-18  5.89e-01  2.13e+00 

                  Estimate Std. Error    t value  Pr(>|t|)
(Intercept)     -2.043e-17    0.04658 -4.386e-16 1.000e+00
duncdatemp[, 1]  2.721e-01    0.05255  5.179e+00 3.923e-07
duncdatemp[, 2]  1.512e-01    0.05364  2.819e+00 5.113e-03
duncdatemp[, 6]  4.034e-01    0.10431  3.867e+00 1.330e-04

Residual standard error: 0.8449 on 325 degrees of freedom

> # most variance in aspiration outside the model
----------------------
ASIDE former additional exercise (left here for possible interest)
> # problem 3 asks you to add a path in the variable 5 prediction from var3
> hndout2mod = tsls(duncdatemp[,5] ~ duncdatemp[,1] + duncdatemp[,2] + duncdatemp[,6] + duncdatemp[,3], ~ duncdatemp[,1] + 
+ duncdatemp[,2] + duncdatemp[,3] + duncdatemp[,4])                                          
> summary(hndout2mod)

 2SLS Estimates

Model Formula: duncdatemp[, 5] ~ duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 
    6] + duncdatemp[, 3]

Instruments: ~duncdatemp[, 1] + duncdatemp[, 2] + duncdatemp[, 3] + duncdatemp[, 
    4]

Residuals:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-2.31e+00 -5.93e-01 -9.05e-02 -1.69e-19  6.08e-01  2.16e+00 

                  Estimate Std. Error    t value  Pr(>|t|)
(Intercept)     -1.639e-17    0.04604 -3.560e-16 1.000e+00
duncdatemp[, 1]  2.855e-01    0.05260  5.427e+00 1.124e-07
duncdatemp[, 2]  1.570e-01    0.05314  2.955e+00 3.359e-03
duncdatemp[, 6]  2.773e-01    0.12975  2.137e+00 3.331e-02
duncdatemp[, 3]  9.733e-02    0.06083  1.600e+00 1.106e-01

Residual standard error: 0.8351 on 324 degrees of freedom

> # raises standard error on the endogenous var[6] prediction and lowers coeff
> # endogenous var representing the causal loop still significant but less so
> # fit barely improved
> END additional exercise
------------------------------

Problem 3.
simulation described in Freedman text (note: page 199 revised version)
bottom p.191-top p.192
scan of pages 191-192 at
http://www-stat.stanford.edu/~rag/stat209/dafp191.pdf

> library(MASS)

> ?mvrnorm # shows you the help file; easiest way to create artificial data
> # Freedman pp191-2, simulate 100 obs
make .3 correlation delta epsilon
> zeddelteps = mvrnorm(n=100, mu = c(0,0,0), Sigma= 
+ matrix(nrow = 3, ncol = 3, data = c(3,0,0,0,1,.3,0,.3,1), byrow = T), empirical = TRUE)
> cor(zeddelteps)
              [,1]         [,2]          [,3]
[1,]  1.000000e+00 2.661070e-16 -3.463923e-16
[2,]  2.661070e-16 1.000000e+00  3.000000e-01
[3,] -3.463923e-16 3.000000e-01  1.000000e+00
> # could also just take what I get from mvrnorm to get close go up to 400 at least

> xsim = .75*zeddelteps[,1] + zeddelteps[,2] # C = .75 set
> cor(xsim, zeddelteps[,1])
[1] 0.7924058

> ysim = .5*xsim + zeddelteps[,3] # beta = .5 set
> cor(ysim,xsim)
[1] 0.7140388
> ols = lm(ysim ~ xsim)
> summary(ols)

Call:
lm(formula = ysim ~ xsim)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8382 -0.4933  0.0961  0.6177  1.8380 

Coefficients:
              Estimate Std. Error   t value Pr(>|t|)    
(Intercept) -2.542e-16  9.881e-02 -2.57e-15        1    
xsim         6.116e-01  6.058e-02     10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.9881 on 98 degrees of freedom
Multiple R-Squared: 0.5099,     Adjusted R-squared: 0.5048 
F-statistic: 101.9 on 1 and 98 DF,  p-value: < 2.2e-16 
> .1116/sqrt(.06058) # so we have a bit of bias from this correl
[1] 0.4534188
> cor(xsim,zeddelteps[,3])
[1] 0.1829983

> cov(ysim, zeddelteps[,1])/cov(xsim, zeddelteps[,1]) #IV
[1] 0.5
> # so the IV estimate gets beta
> # do 2SLS version of IV
> xpred = lm(xsim ~  zeddelteps[,1])
> stage2 = lm(ysim ~ fitted(xpred))
> summary(stage2)

Call:
lm(formula = ysim ~ fitted(xpred))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2864 -0.9196  0.1151  0.8456  3.3366 

Coefficients:
                Estimate Std. Error   t value Pr(>|t|)    
(Intercept)   -4.441e-17  1.251e-01 -3.55e-16        1    
fitted(xpred)  5.000e-01  9.681e-02     5.165 1.27e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 1.251 on 98 degrees of freedom
Multiple R-Squared: 0.2139,     Adjusted R-squared: 0.2059 
F-statistic: 26.67 on 1 and 98 DF,  p-value: 1.266e-06 

matches IV

> # even in this benign case IV via 2SLS has 50% larger standard error,
> # enough to offset the (slight) bias of OLS???


==============================


==============================
END solutions HW6 2016