Stat209/Ed260 D Rogosa 2/1/16 Solutions Assignment 4. Multilevel data, ecological fallacy ---------------------------------------------------------------------- NOTE: R multilevel (mixed effects) modelling (lmer) exercises in Lab 2 ---------------------------------------------------------------------- Problem 1 So this is another data generation exercise, with the purpose of illustrating the multiple regression multi-level analyses in week 4 Tues lecture A bit tedious, but at least be familiar with the analysis options and what those give you. > # grouping problem > library(MASS) # MASS is part of R-distribution, > x = c(1:200) # integers 1 to 200 > ?sample # to see how sample works > var(x) [1] 3350 > u = rnorm(200,0,70) > y = 10 + x +u #this gives me theoretical x,y correlation .637 > cor(x,y) [1] 0.6389937 #pretty close # could have been fancier and used mvrnorm and gotten target exactly > totreg = lm(y ~ x) #individual level regression, theoretical slope 1.0 > summary(totreg) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -231.923 -50.030 -2.716 50.311 210.550 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.6415 9.9208 0.669 0.504 x 1.0006 0.0856 11.689 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 69.89 on 198 degrees of freedom Multiple R-Squared: 0.4083, Adjusted R-squared: 0.4053 F-statistic: 136.6 on 1 and 198 DF, p-value: < 2.2e-16 > #group by x into 10 groups of 20 on the basis of x #here's a crude do loop for clumping the x's and getting group means, there's many many less simple-minded ways another syntax for this using the indices easier take elements mean(x[1:20]) etc > xBar = c() > for(i in 1:10) { + xBar[i] = mean(x[(i - 1)*20 + 1: 20]) } > xBar [1] 10.5 30.5 50.5 70.5 90.5 110.5 130.5 150.5 170.5 190.5 > var(xBar) #var uses n-1 in denom, try ?var [1] 3666.667 > var(x) [1] 3350 > var(xBar)*.9/var(x) # approximately eta-sq in the sample [1] 0.9850746 # so eta-sq essentially 1 because we grouped on x > > yBar = c() > for(i in 1:10) { + yBar[i] = mean(y[(i - 1)*20 + 1: 20]) } > yBar [1] -6.75261 55.07633 58.72489 91.47110 78.39501 137.13548 134.15138 141.41726 198.67811 [10] 183.66834 > mean(y[1:20]) # check the do-loop directly [1] -6.75261 > mean(y[21:40]) [1] 55.07633 > mean(y[41:60]) [1] 58.72489 #between group regression (10 observations) > betwreg = lm(yBar ~ xBar) > summary(betwreg) Call: lm(formula = yBar ~ xBar) Residuals: Min 1Q Median 3Q Max -23.4243 -15.5664 -0.7001 17.3284 21.0734 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.1105 11.8190 0.517 0.619 xBar 1.0058 0.1021 9.851 9.49e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 18.55 on 8 degrees of freedom Multiple R-Squared: 0.9238, Adjusted R-squared: 0.9143 F-statistic: 97.05 on 1 and 8 DF, p-value: 9.49e-06 # for relative standing easiest to make a vector of length 200 where each individual's attribute is the group mean. # I did this in the crudest possible way, laughably crude > xBar200 = c(rep(xBar[1],20), rep(xBar[2],20),rep(xBar[3],20),rep(xBar[4],20),rep(xBar[5],20),rep(xBar[6],20),rep(xBar[7],20),rep(xBar[8],20),rep(xBar[9],20),rep(xBar[10],20) ) > xBar200 [1] 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 [31] 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 [61] 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 90.5 90.5 90.5 90.5 90.5 90.5 90.5 90.5 90.5 90.5 [91] 90.5 90.5 90.5 90.5 90.5 90.5 90.5 90.5 90.5 90.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 [121] 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 [151] 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 [181] 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 > yBar200 = c(rep(yBar[1],20), rep(yBar[2],20),rep(yBar[3],20),rep(yBar[4],20),rep(yBar[5],20),rep(yBar[6],20),rep(yBar[7],20),rep(yBar[8],20),rep(yBar[9],20),rep(yBar[10],20) ) > relx = x - xBar200 > rely = y - yBar200 # some description > cor(relx, rely) [1] 0.04065814 > summary(rely) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.274e+02 -4.637e+01 -5.717e+00 2.118e-15 5.049e+01 1.841e+02 > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -115.90 42.77 110.60 107.20 169.40 363.60 > summary(yBar) Min. 1st Qu. Median Mean 3rd Qu. Max. -6.753 63.640 112.800 107.200 140.300 198.700 #relative standing regression > relreg = lm(rely ~ relx) > summary(relreg) Call: lm(formula = rely ~ relx) Residuals: Min 1Q Median 3Q Max -228.074 -46.075 -5.102 49.937 188.174 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.283e-15 4.794e+00 4.76e-16 1.000 relx 4.761e-01 8.314e-01 0.573 0.568 Residual standard error: 67.8 on 198 degrees of freedom Multiple R-Squared: 0.001653, Adjusted R-squared: -0.003389 F-statistic: 0.3279 on 1 and 198 DF, p-value: 0.5676 > #multiple regression versions > regcontext = lm(y ~ x + xBar200) > summary(regcontext) Call: lm(formula = y ~ x + xBar200) Residuals: Min 1Q Median 3Q Max -231.294 -52.950 -2.927 53.340 206.462 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.1105 9.9741 0.613 0.541 x 0.4761 0.8584 0.555 0.580 xBar200 0.5298 0.8627 0.614 0.540 Residual standard error: 70 on 197 degrees of freedom Multiple R-Squared: 0.4094, Adjusted R-squared: 0.4034 F-statistic: 68.29 on 2 and 197 DF, p-value: < 2.2e-16 # .4761 coeff of x matches the relative standing slope (within-pooled) # .5298 coeff of xBar matches between slope minus relative standing slope (context effect) > regcronbach = lm(y ~ relx + xBar200) > summary(regcronbach) Call: lm(formula = y ~ relx + xBar200) Residuals: Min 1Q Median 3Q Max -231.294 -52.950 -2.927 53.340 206.462 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.11046 9.97406 0.613 0.541 relx 0.47607 0.85837 0.555 0.580 xBar200 1.00583 0.08616 11.674 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 70 on 197 degrees of freedom Multiple R-Squared: 0.4094, Adjusted R-squared: 0.4034 F-statistic: 68.29 on 2 and 197 DF, p-value: < 2.2e-16 # .4761 coeff of x matches the relative standing slope (win-pooled) # 1.0058 coeff of xBar matches between slope Approximate Duncan-Cuzzort-Duncan relation > .985*1.0058 + (1 - .985)*.476 [1] 0.997853 close to the total slope 1.00 ----------------------------------------------------------------- > # now try random grouping (part II) use "Rnd" suffix to indicate groups formed random grouping (independent of x) > #get indices > k = 1:200 > rndind = sample(k , size = 200, p = rep(1:1,200)/200) > rndind [1] 48 52 197 31 79 62 69 129 172 84 20 177 43 195 113 24 75 74 173 144 65 137 41 168 3 185 103 141 107 170 184 198 101 8 126 180 153 32 196 80 95 183 124 56 18 [46] 4 169 67 167 28 138 109 146 37 50 122 148 174 96 178 46 72 115 121 40 19 83 81 108 42 133 128 190 70 164 145 132 92 36 176 13 16 158 10 9 49 90 159 152 160 [91] 1 78 11 110 182 7 58 188 147 179 45 162 91 130 155 106 60 191 39 66 47 117 15 61 166 135 116 57 123 112 140 54 86 44 2 114 163 51 21 14 82 26 154 189 29 [136] 55 118 139 27 200 161 22 102 199 150 6 35 143 157 104 64 53 165 131 105 98 151 89 149 85 73 59 12 30 25 136 88 5 134 156 119 93 171 186 68 87 194 97 192 94 [181] 142 17 100 33 187 71 193 181 77 125 99 120 34 38 63 23 111 127 76 175 #what I did here is draw a random ordering of the integers 1:200 that I'll use to create groups for part II (Rnd suffix) now create the groups > xBarRnd = c() > for(i in 1:10) { + xBarRnd[i] = mean(x[rndind[(i -1)*20 + 1:20]]) } #result shows xBar rather similar across the 10 groups > xBarRnd [1] 98.05 118.90 105.45 99.65 88.85 99.70 85.40 108.45 100.95 99.60 > var(xBarRnd) [1] 88.70833 > var(x) [1] 3350 > var(xBarRnd)/var(x) [1] 0.0264801 very small value of the eta-sq > #grouping very small (groups not formed on x) > yBarRnd = c() > for(i in 1:10) { + yBarRnd[i] = mean(y[rndind[(i -1)*20 + 1:20]]) } > yBarRnd [1] 126.79351 135.30195 91.02650 76.22435 88.59928 96.39977 71.77486 147.88802 107.30942 130.64764 > cor(xBarRnd, yBarRnd) [1] 0.6581125 > cor(x,y) [1] 0.6389937 > cor(xBar, yBar) [1] 0.9611697 # grouping on X gave a much higher between-group correlation > betwregRnd = lm(yBarRnd ~ xBarRnd) > summary(betwregRnd) Call: lm(formula = yBarRnd ~ xBarRnd) Residuals: Min 1Q Median 3Q Max -29.399 -8.854 -3.339 18.842 25.974 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -78.8572 75.5530 -1.044 0.3271 xBarRnd 1.8513 0.7488 2.472 0.0386 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 21.16 on 8 degrees of freedom Multiple R-Squared: 0.4331, Adjusted R-squared: 0.3623 F-statistic: 6.112 on 1 and 8 DF, p-value: 0.03857 #between slope is larger from random grouping than under grouping on X > plot(xBarRnd, yBarRnd) #now create the elongated group mean vector > xBarRnd200 = c(rep(xBarRnd[1],20), rep(xBarRnd[2],20),rep(xBarRnd[3],20),rep(xBarRnd[4],20),rep(xBarRnd[5],20),rep(xBarRnd[6],20),rep(xBarRnd[7],20),rep(xBarRnd[8],20),rep(xBarRnd[9],20),rep(xBarRnd[10],20) ) > yBarRnd200 = c(rep(yBarRnd[1],20), rep(yBarRnd[2],20),rep(yBarRnd[3],20),rep(yBarRnd[4],20),rep(yBarRnd[5],20),rep(yBarRnd[6],20),rep(yBarRnd[7],20),rep(yBarRnd[8],20),rep(yBarRnd[9],20),rep(yBarRnd[10],20) ) #relative standing measures > relxRnd = x - xBarRnd200 > relyRnd = y - yBarRnd200 > relregRnd = lm(relyRnd ~ relxRnd) > summary(relregRnd) Call: lm(formula = relyRnd ~ relxRnd) Residuals: Min 1Q Median 3Q Max -208.6740 -50.4961 0.5453 50.0308 196.2732 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.655e-15 5.305e+00 3.12e-16 1 relxRnd 9.376e-01 8.757e-02 10.71 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 75.03 on 198 degrees of freedom Multiple R-Squared: 0.3667, Adjusted R-squared: 0.3635 F-statistic: 114.6 on 1 and 198 DF, p-value: < 2.2e-16 #coeffs for random groups are 1.85 between and .938 within for this ex, your artificial data will differ slightly -------------------------------------------------------------------------- Problem 2 Contextual Effects Coefficient solution posted at http://rogosateaching.com/stat209/hw4p2.pdf -------------------------------------------------------------------------- Problem 3 Gender gap analysis of UK Exam data > ## exam data, problem 3 HW4 > > mixed = read.table("http://web.stanford.edu/~rag/stat209/mExam4347", header = T) > library(lme4) Loading required package: Matrix > ggaplist = lmList(normexam ~ sex|school, data = mixed) > gapCoef = coef(ggaplist) > fivenum(gapCoef[,1]) # fem means [1] -0.83275722 -0.25973997 -0.02976871 0.25940135 0.96429515 > attach(mixed) > cor.test(unique(schavg), gapCoef[,1]) # correlation schavg and Fem mean Pearson's product-moment correlation data: unique(schavg) and gapCoef[, 1] t = 4.0061, df = 31, p-value = 0.0003592 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3011114 0.7724681 sample estimates: cor 0.5840452 > # that's what the schavg coeff in ggaplmer2 is showing > # do the same for male means > cor.test(unique(schavg), gapCoef[,1] + gapCoef[,2]) # correlation schavg and Male mean Pearson's product-moment correlation data: unique(schavg) and gapCoef[, 1] + gapCoef[, 2] t = 4.6381, df = 31, p-value = 6.044e-05 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3803045 0.8062057 sample estimates: cor 0.6400477 > # again big; intake is correlated with Fem and male mean levels, but gap does not vary with schavg > plot(unique(schavg), gapCoef[,2], pch = 20) # plot shown in class, schavg not association with ggap > plot(unique(schavg), gapCoef[,1], pch = 20) # female mean > plot(unique(schavg), gapCoef[,1] + gapCoef[,2], pch = 20) # male mean plot > # strong association seen with male and female means, not with ggap =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= --------------------------------------------------------------------- Enrichment problem (better to spend time on Lab2 etc) Ecological fallacy: Is Radon good for you? ------------------------------------ Following p.750 Greenland and Robins American Journal of Epidemiology Vol. 139, No. 8: 747-760 Reading through their exposition probably is helpful here's my attempt (it can be done more elegantly) #make the region indices > r = c(0:40) > r [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [26] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #now radon concentrations > regradon = .1 + .3*sqrt(r) > regradon [1] 0.1000000 0.4000000 0.5242641 0.6196152 0.7000000 0.7708204 0.8348469 [8] 0.8937254 0.9485281 1.0000000 1.0486833 1.0949874 1.1392305 1.1816654 [15] 1.2224972 1.2618950 1.3000000 1.3369317 1.3727922 1.4076697 1.4416408 [22] 1.4747727 1.5071247 1.5387495 1.5696938 1.6000000 1.6297059 1.6588457 [29] 1.6874508 1.7155494 1.7431677 1.7703293 1.7970563 1.8233688 1.8492856 [36] 1.8748239 1.9000000 1.9248288 1.9493242 1.9734994 1.9973666 > # set up smoking variable (counterbalanced with radon) > u = rnorm(41) > var(u) [1] 0.9883518 > mean(u) [1] 0.1323434 > #that's a good enough u > s = .2*(34 + .4*r -u) + .4*(13 - .2*r -.5*u) > s [1] 11.65626 12.02922 11.45928 11.86984 11.96606 12.05849 11.54374 11.99844 [9] 11.42576 12.42389 11.88704 12.42439 11.93378 12.39780 11.55852 11.78481 [17] 11.69073 11.74597 12.69344 12.01483 11.84285 10.74535 11.70950 12.54065 [25] 12.25184 12.75854 12.22308 12.12930 11.85289 12.27538 11.76754 11.37832 [33] 12.27040 12.22711 11.75117 11.99551 12.49188 12.12246 11.60682 11.72744 [41] 11.59924 #smoking proportions > p0 = 53 - .2*r + 1.5*u > p1 = 34 + .4*r - u > p2 = 13 - .2*r - .5*u #eq2 p.751 > lungc = .40*(1 + .2*regradon)*exp(s/10)*(p0 + 2.7183*2.7183*p1 + exp(4)*p2) #resulting data > cor(lungc, s) [1] 0.8518681 > cor(lungc, regradon) [1] -0.2588493 > cor(s, regradon) [1] 0.1254401 > plot( regradon, lungc) #my plot is at http://www-stat.stanford.edu/~rag/stat209/hw4p1.pdf #run the regression > luncreg = lm(lungc ~ regradon + s) > summary(luncreg) Call: lm(formula = lungc ~ regradon + s) Residuals: Min 1Q Median 3Q Max -165.490 -28.550 9.463 38.072 77.763 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2012.52 239.73 -8.395 3.48e-10 *** regradon -102.60 16.80 -6.107 4.06e-07 *** s 298.39 20.21 14.768 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 50.42 on 38 degrees of freedom Multiple R-Squared: 0.8616, Adjusted R-squared: 0.8543 F-statistic: 118.2 on 2 and 38 DF, p-value: < 2.2e-16 even with confounder (smoking) "controlled for" [laughter supressed] radon concentration negatively associated with lung cancer in this regional-level regression. Causal conclusion--start pumping radioactive gas into everyone's basements? (or be careful about ecological fallacy, and do analysis at individual level) ============================================================================= end HW4 solutions 2015