week 6 RQ1  Problem 1

So this is another data generation exercise, with the purpose of illustrating
the multiple regression multi-level analyses in week 6  lecture
A bit tedious, but at least be familiar with the analysis options
and what those give you.

> #  grouping problem
> library(MASS)   # MASS is part of  R-distribution, 

> x = c(1:200)  # integers 1 to 200
> ?sample  # to see how sample works
> var(x)
[1] 3350

> u = rnorm(200,0,70)
> y = 10 + x +u #this gives me theoretical x,y correlation .637 
> cor(x,y)
[1] 0.6389937         #pretty close
# could have been fancier and used mvrnorm and gotten target exactly

> totreg = lm(y ~ x)  #individual level regression, theoretical slope 1.0
> summary(totreg)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-231.923  -50.030   -2.716   50.311  210.550 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.6415     9.9208   0.669    0.504    
x             1.0006     0.0856  11.689   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 69.89 on 198 degrees of freedom
Multiple R-Squared: 0.4083,     Adjusted R-squared: 0.4053 
F-statistic: 136.6 on 1 and 198 DF,  p-value: < 2.2e-16 


> #group by x into 10 groups of 20 on the basis of x

#here's a crude do loop for clumping the x's and getting group means,
there's many many less simple-minded ways

another syntax for this using the indices
easier take elements mean(x[1:20]) etc

> xBar = c()
> for(i in 1:10) {
+ xBar[i] = mean(x[(i - 1)*20 + 1: 20]) }
> xBar
 [1]  10.5  30.5  50.5  70.5  90.5 110.5 130.5 150.5 170.5 190.5


> var(xBar)              #var uses n-1 in denom, try ?var
[1] 3666.667
> var(x)
[1] 3350
> var(xBar)*.9/var(x) # approximately eta-sq in the sample
[1] 0.9850746            # so eta-sq essentially 1 because we grouped on x

> 
> yBar = c()
> for(i in 1:10) {
+ yBar[i] = mean(y[(i - 1)*20 + 1: 20]) }
> yBar
 [1]  -6.75261  55.07633  58.72489  91.47110  78.39501 137.13548 134.15138 141.41726 198.67811
[10] 183.66834

> mean(y[1:20]) # check the do-loop directly
[1] -6.75261
> mean(y[21:40])
[1] 55.07633
> mean(y[41:60])
[1] 58.72489

#between group regression (10 observations)
> betwreg = lm(yBar ~ xBar)
> summary(betwreg)

Call:
lm(formula = yBar ~ xBar)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.4243 -15.5664  -0.7001  17.3284  21.0734 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.1105    11.8190   0.517    0.619    
xBar          1.0058     0.1021   9.851 9.49e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 18.55 on 8 degrees of freedom
Multiple R-Squared: 0.9238,     Adjusted R-squared: 0.9143 
F-statistic: 97.05 on 1 and 8 DF,  p-value: 9.49e-06 


# for relative standing easiest to make a vector of length 200 where
each individual's attribute is the group mean.
# I did this in the crudest possible way, laughably crude

> xBar200 = c(rep(xBar[1],20), rep(xBar[2],20),rep(xBar[3],20),rep(xBar[4],20),rep(xBar[5],20),rep(xBar[6],20),rep(xBar[7],20),rep(xBar[8],20),rep(xBar[9],20),rep(xBar[10],20) )
> xBar200
  [1]  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  10.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5
 [31]  30.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5  30.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5  50.5
 [61]  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  70.5  90.5  90.5  90.5  90.5  90.5  90.5  90.5  90.5  90.5  90.5
 [91]  90.5  90.5  90.5  90.5  90.5  90.5  90.5  90.5  90.5  90.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5 110.5
[121] 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 130.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5
[151] 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 150.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5 170.5
[181] 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5 190.5

> yBar200 = c(rep(yBar[1],20), rep(yBar[2],20),rep(yBar[3],20),rep(yBar[4],20),rep(yBar[5],20),rep(yBar[6],20),rep(yBar[7],20),rep(yBar[8],20),rep(yBar[9],20),rep(yBar[10],20) )
> relx = x - xBar200
> rely = y - yBar200
# some description
> cor(relx, rely)
[1] 0.04065814
> summary(rely)
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-2.274e+02 -4.637e+01 -5.717e+00  2.118e-15  5.049e+01  1.841e+02 
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-115.90   42.77  110.60  107.20  169.40  363.60 
> summary(yBar)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -6.753  63.640 112.800 107.200 140.300 198.700 

#relative standing regression
> relreg = lm(rely ~ relx)
> summary(relreg)

Call:
lm(formula = rely ~ relx)

Residuals:
     Min       1Q   Median       3Q      Max 
-228.074  -46.075   -5.102   49.937  188.174 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)
(Intercept) 2.283e-15  4.794e+00 4.76e-16    1.000
relx        4.761e-01  8.314e-01    0.573    0.568

Residual standard error: 67.8 on 198 degrees of freedom
Multiple R-Squared: 0.001653,   Adjusted R-squared: -0.003389 
F-statistic: 0.3279 on 1 and 198 DF,  p-value: 0.5676 

> #multiple regression versions

> regcontext = lm(y ~ x + xBar200)
> summary(regcontext)

Call:
lm(formula = y ~ x + xBar200)

Residuals:
     Min       1Q   Median       3Q      Max 
-231.294  -52.950   -2.927   53.340  206.462 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.1105     9.9741   0.613    0.541
x             0.4761     0.8584   0.555    0.580
xBar200       0.5298     0.8627   0.614    0.540

Residual standard error: 70 on 197 degrees of freedom
Multiple R-Squared: 0.4094,     Adjusted R-squared: 0.4034 
F-statistic: 68.29 on 2 and 197 DF,  p-value: < 2.2e-16 

# .4761 coeff of x matches the relative standing slope (within-pooled)
# .5298 coeff of xBar matches between slope minus relative standing slope (context effect)


> regcronbach = lm(y ~ relx + xBar200)
> summary(regcronbach)

Call:
lm(formula = y ~ relx + xBar200)

Residuals:
     Min       1Q   Median       3Q      Max 
-231.294  -52.950   -2.927   53.340  206.462 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.11046    9.97406   0.613    0.541    
relx         0.47607    0.85837   0.555    0.580    
xBar200      1.00583    0.08616  11.674   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 70 on 197 degrees of freedom
Multiple R-Squared: 0.4094,     Adjusted R-squared: 0.4034 
F-statistic: 68.29 on 2 and 197 DF,  p-value: < 2.2e-16 

# .4761 coeff of x matches the relative standing slope (win-pooled)
# 1.0058 coeff of xBar matches between slope

Approximate Duncan-Cuzzort-Duncan relation
> .985*1.0058 + (1 - .985)*.476
[1] 0.997853
close to the total slope 1.00
-----------------------------------------------------------------

> # now try random grouping (part II) use "Rnd" suffix to indicate groups formed 
                                                random grouping (independent of x)
> #get indices
> k = 1:200
> rndind = sample(k , size = 200, p = rep(1:1,200)/200)
> rndind
  [1]  48  52 197  31  79  62  69 129 172  84  20 177  43 195 113  24  75  74 173 144  65 137  41 168   3 185 103 141 107 170 184 198 101   8 126 180 153  32 196  80  95 183 124  56  18
 [46]   4 169  67 167  28 138 109 146  37  50 122 148 174  96 178  46  72 115 121  40  19  83  81 108  42 133 128 190  70 164 145 132  92  36 176  13  16 158  10   9  49  90 159 152 160
 [91]   1  78  11 110 182   7  58 188 147 179  45 162  91 130 155 106  60 191  39  66  47 117  15  61 166 135 116  57 123 112 140  54  86  44   2 114 163  51  21  14  82  26 154 189  29
[136]  55 118 139  27 200 161  22 102 199 150   6  35 143 157 104  64  53 165 131 105  98 151  89 149  85  73  59  12  30  25 136  88   5 134 156 119  93 171 186  68  87 194  97 192  94
[181] 142  17 100  33 187  71 193 181  77 125  99 120  34  38  63  23 111 127  76 175

#what I did here is draw a random ordering of the integers 1:200 that I'll
use to create groups for part II (Rnd suffix)

now create the groups
> xBarRnd = c()
> for(i in 1:10) {
+ xBarRnd[i] = mean(x[rndind[(i -1)*20 + 1:20]])  }

#result shows xBar rather similar across the 10 groups
> xBarRnd
 [1]  98.05 118.90 105.45  99.65  88.85  99.70  85.40 108.45 100.95  99.60
> var(xBarRnd)
[1] 88.70833
> var(x)
[1] 3350
> var(xBarRnd)/var(x)
[1] 0.0264801
very small value of the eta-sq
> #grouping very small (groups not formed on x)

> yBarRnd = c()
> for(i in 1:10) {
+ yBarRnd[i] = mean(y[rndind[(i -1)*20 + 1:20]])  }

> yBarRnd
 [1] 126.79351 135.30195  91.02650  76.22435  88.59928  96.39977  71.77486 147.88802 107.30942 130.64764

> cor(xBarRnd, yBarRnd)
[1] 0.6581125
> cor(x,y)
[1] 0.6389937
> cor(xBar, yBar)
[1] 0.9611697
# grouping on X gave a much higher between-group correlation

> betwregRnd = lm(yBarRnd ~ xBarRnd)
> summary(betwregRnd)

Call:
lm(formula = yBarRnd ~ xBarRnd)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.399  -8.854  -3.339  18.842  25.974 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -78.8572    75.5530  -1.044   0.3271  
xBarRnd       1.8513     0.7488   2.472   0.0386 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 21.16 on 8 degrees of freedom
Multiple R-Squared: 0.4331,     Adjusted R-squared: 0.3623 
F-statistic: 6.112 on 1 and 8 DF,  p-value: 0.03857 

#between slope is larger from random grouping than under grouping on X


> plot(xBarRnd, yBarRnd)

#now create the elongated group mean vector
> xBarRnd200 = c(rep(xBarRnd[1],20), rep(xBarRnd[2],20),rep(xBarRnd[3],20),rep(xBarRnd[4],20),rep(xBarRnd[5],20),rep(xBarRnd[6],20),rep(xBarRnd[7],20),rep(xBarRnd[8],20),rep(xBarRnd[9],20),rep(xBarRnd[10],20) )
> yBarRnd200 = c(rep(yBarRnd[1],20), rep(yBarRnd[2],20),rep(yBarRnd[3],20),rep(yBarRnd[4],20),rep(yBarRnd[5],20),rep(yBarRnd[6],20),rep(yBarRnd[7],20),rep(yBarRnd[8],20),rep(yBarRnd[9],20),rep(yBarRnd[10],20) )

#relative standing measures
> relxRnd = x - xBarRnd200
> relyRnd = y - yBarRnd200
> relregRnd = lm(relyRnd ~ relxRnd)

> summary(relregRnd)

Call:
lm(formula = relyRnd ~ relxRnd)

Residuals:
      Min        1Q    Median        3Q       Max 
-208.6740  -50.4961    0.5453   50.0308  196.2732 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 1.655e-15  5.305e+00 3.12e-16        1    
relxRnd     9.376e-01  8.757e-02    10.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 75.03 on 198 degrees of freedom
Multiple R-Squared: 0.3667,     Adjusted R-squared: 0.3635 
F-statistic: 114.6 on 1 and 198 DF,  p-value: < 2.2e-16 


#coeffs for random groups are 1.85 between and .938 within for
this ex, your artificial data will differ slightly

--------------------------------------------------------------------------