Stat209/Ed260 D Rogosa   1/27/19

               Solutions Assignment 3. 


Problem 1, Blau-Duncan path analysis

first part
refer to Freedman text  or Freedman "Statistical models for causation" linked
for week 3 readings
> # do Blau-Duncan path analysis from correlation matrices (cf Lab 1 and week 3 handouts)
# predictor correlation matrices
> RxxBD1 =  matrix( nrow = 2, ncol = 2, data = c(1, 0.516, .516, 1), byrow = T )
> RxxBD2 =  matrix( nrow = 2, ncol = 2, data = c(1, 0.438, .438, 1), byrow = T )
> RxxBD3 =  matrix( nrow = 3, ncol = 3, data = c(1, 0.438, .538, .438, 1, .417, .538,.417,1), byrow = T )
> RxxBD3
      [,1]  [,2]  [,3]
[1,] 1.000 0.438 0.538
[2,] 0.438 1.000 0.417
[3,] 0.538 0.417 1.000

# correlation matrices between outcome var and predictors
> RxyBD1 =  matrix( nrow = 2, ncol = 1, data = c(0.453,.438) )
> RxyBD2 =  matrix( nrow = 2, ncol = 1, data = c(0.538,.417) )
> RxyBD3 =  matrix( nrow = 3, ncol = 1, data = c(0.596,.405, .541) )
> #path coeffs for the 3 eqs pp.76-77 DAF text

# obtain the standardized regression coefficients
> pathcBD1 = t(RxyBD1)%*%solve(RxxBD1)
> pathcBD2 = t(RxyBD2)%*%solve(RxxBD2)
> pathcBD3 = t(RxyBD3)%*%solve(RxxBD3)
> pathcBD1
          [,1]      [,2]
[1,] 0.3093613 0.2783696
> pathcBD2
          [,1]      [,2]
[1,] 0.4397097 0.2244072

> pathcBD3
          [,1]      [,2]      [,3]
[1,] 0.3945428 0.1151266 0.2807282


> #coeffs match well p.76
> #get Rsq and coeff for distrurbance terms
> RsqBD1 = pathcBD1%*%RxyBD1
> sqrt(1-RsqBD1)
          [,1]
[1,] 0.8590305
> RsqBD2 = pathcBD2%*%RxyBD2
> sqrt(1-RsqBD2)
          [,1]
[1,] 0.8184488
> RsqBD3 = pathcBD3%*%RxyBD3
> sqrt(1-RsqBD3)
          [,1]
[1,] 0.7525638
> #coeff for disturbance terms matched (DAF SD's p.79)

> # aside: in his solutions Freedman does these eq's more generally all at once, 
which is fine and that can be written in his notation as
A1.  Y = [ U X W ]' [ e f g ] + eta = M' [e f g] + eta
The coefficient estimates are given by
[eHat fHat gHat] = (M'M)^(-1) M'y.
The matrix M'M is given by the correlation matrix of the predictors.
The value M'y is given by the correlations of the predictors with the
response.
The standard deviations can be calculated exactly as shown in the book on
page 79 (1st ed).

second part text problems
scan of problems from  Freedman text at
  http://www.stanford.edu/~rag/stat209/DAFtextp8081.pdf

A5.  We should measure variation by the standard deviation, not the
variance.  The reason is that variation is E[( X - \mu)^2], which is on the
scale of X^2, and has units of X^2.  The standard deviation is on the scale
of X, and has units of X.

A6.  Since there is an arrow indirectly from V to W that passes through U,
the model says that the father's education affects the son's first job by
affecting the son's education.  An arrow from V to W or V to Y would imply
that V has a direct affect on W or Y, respectively.  In principle, there is
nothing wrong with such a model.  However, the predictors would be highly
correlated, resulting in unstable estimates of the coefficients.
In principle, there could be an arrow from Y to V, which would be
interpreted as the son's occupation affects his father's education.  For example,
the son's occupation may require him to further his education in order to
work for his son's company. But then this model would be recursive, much
more complex to estimate.

prob 8
A8.  Let Es denote the education of the son, Ef the education of the father,
and Of the occupation of the father.  Equation 1 says, Es = a* Ef + b * Of.
If education is a 0-1 variable, then we would interpret the fitted values,
EsHat, as the probability that Es is equal to 1.  Since probabilities are
bounded by 0 and 1, we might question the validity of this equation because
the coefficients a and b can be chosen so that the response is unbounded.
In this situation a generalized linear model with binomial random component
(i.e. logistic regression) may be a better model.

freedman p.97 prob 4(a,b) 
E4.  (a)  False.  The effect is indirect.  The father's education influences
the son's education which in turn influences the son's job.
(b)  True.

============================================================
Problem 2

I have used this path analysis in past exam problems 

Causal Models of Publishing Productivity

Homework 3 problem 2 considers one of the path
analysis models from "Causal Models of Publishing
Productivity in Psychology", Rogers & Maranto,
J. Applied Psychology, 1989, 74(4), 636-649.
direct link to paper
http://content.apa.org/journals/apl/74/4/636.pdf

The path analysis conducted by the authors
from a sample of 86 men and 76 women is shown
in p.101 of Freedman's text and on page 647
of the publication; that page also exists at
http://www-stat.stanford.edu/~rag/stat209/pathpage647.pdf

You do have the correlation matrix from adding Table 7 fits and residuals.
But here all the problem asks you to do is look at and consider the usefulness
of this analysis. Note they don't display the disturbance paths so we don't
get a look at Rsq values.
What are the predictors of Pubs (direct effects) in this picture?
What are the predictors of Cites (direct effects) in this picture?
PUBS has 4 arrows in: PUBS ~ SEX + QFJ + ABILITY + PREPROD [in R-speak]
CITES also has 4 arrows in: CITES ~ PUBS + QFJ + ABILITY + PREPROD 

think about week1 results for these.
The diagram provides estimates of supposed causal effects ("causal
model of publishing" is the article); it displays regression coeffs--
 coefficient estimates shown on the "paths".  
Consider a "productive
researcher" to be defined in terms of the number of publications and the
number of cites.  The good news is that ability "affects" pubs and cites with
a positive coefficient in each case.  Therefore, higher ability leads to a
more "productive researcher", according to the causal path gospel.
Some bad news is that sex is a predictor of pubs with a large coefficient
value.  However, it is likely that there are confounding variables between
sex and pubs.
Think about what the "as if by experiment" conclusions would be here. Try it again with
a straight face.
By now you can enumerate reasons why these regression coeffs cannot be taken seriously--
regression with observational data, choice of variables, measurement error 
(here very hard things to measure accurately), and individual differences.


Another (side) argument to discount the findings of the study is that only 86/241 = 36%
of  men and 76/244 = 31% of women responded to the survey.  This could
result in sample bias.

===================================================
Problem 3.

This numerical example illustrates the results presented in class
for the longitudinal path analysis (aka Goldstein ex) with
handouts from week 3 Lecture topics , item 2. 

clever read statement (not from me);
I would just save the text file and edit to make the data set

# read in the portion of the web page you need and rename the columns
> casual = read.table("http://www-stat.stanford.edu/~rag/stat209/casualdat",  header=T, skip=1, nrows=40)
#these days use statweb esp if off-campus
> casual = read.table("http://statweb.stanford.edu/~rag/stat209/casualdat",  header=T, skip=1, nrows=40)
> head(casual)
     Xi.1.    Xi.3.    Xi.5.        W
1 37.55913 49.29053 61.02193 15.97247
2 45.65429 51.58451 57.51472 15.37724
3 40.93881 52.87978 64.82076 11.47902
4 47.35937 55.44879 63.53822 16.88944
5 52.70511 62.70351 72.70191 19.17834
6 30.45231 46.34082 62.22934 11.81822

> attach(casual)
> xi1 = casual[,1]
> xi3 = casual[,2]
> xi5 = casual[,3]
# or do names() to simplify var names from file
#path regressions (see handouts) match exactly results stated in handout
# for population coefficients.
> lm1 = lm(xi3 ~ xi1)

> lm2 = lm(xi5 ~ xi3 + xi1)
> summary(lm2)

Call:
lm(formula = xi5 ~ xi3 + xi1)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.974e-05 -5.407e-06 -3.446e-07  5.819e-06  1.489e-05 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -7.385e-06  9.768e-06 -7.560e-01    0.454    
xi3          2.000e+00  3.309e-07  6.045e+06   <2e-16 ***
xi1         -1.000e+00  3.308e-07 -3.023e+06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.061e-06 on 37 degrees of freedom
Multiple R-squared:     1,      Adjusted R-squared:     1 
F-statistic: 2.563e+13 on 2 and 37 DF,  p-value: < 2.2e-16 
------------------------

another equiv version (just because I have it)
 my strategy was to cut out the 40 rows in an editor
can you have parens() in a var name, I cut those out for caution
point here was just to have students create a data set without much typing

> casual = read.table("D:\\drr11\\stat209\\th1\\casualdat",  header=T)
> cor(casual)
          Xi1       Xi3       Xi5
Xi1 1.0000000 0.8421714 0.5357932
Xi3 0.8421714 1.0000000 0.9065112
Xi5 0.5357932 0.9065112 1.0000000

> attach(casual)
> lm1 = lm(Xi3 ~ Xi1)
> lm2 = lm(Xi5 ~ Xi3 + Xi1)
> summary(lm2)

Call:
lm(formula = Xi5 ~ Xi3 + Xi1)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.974e-05 -5.407e-06 -3.446e-07  5.819e-06  1.489e-05 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -7.385e-06  9.768e-06 -7.560e-01    0.454    
Xi3          2.000e+00  3.309e-07  6.045e+06   <2e-16 ***
Xi1         -1.000e+00  3.308e-07 -3.023e+06   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.061e-06 on 37 degrees of freedom
Multiple R-squared:     1,      Adjusted R-squared:     1 
F-statistic: 2.563e+13 on 2 and 37 DF,  p-value: < 2.2e-16 

> #so with Rsq = 1 (perfect fit) you get -1 and 2 for coeffs
> #which match 5-1/3-1, 3-5/3-1 results for coeff; standard errors 
   essentially zero
# to take the HW question literally--yes you can compute standard
errors but it seems odd
from a sample n=40 that these should be zero. Oh yeah, also this
path analysis has R^2 = 1 (perfect fit) with meaningless (data free)
coeffificients.

=============================================================
Problem 4       ENRICHMENT ITEM, Structural Equation Models
Method-of-moments for two-variable, two-indicator model

For the Structural Equation Models handout from Joreskog
book, obtain parameter estimates for the no-correlated error 
version (9 parameters, top covariance matrix) in terms of the 
sample variance and covariances among the four indicators (y_ij).
Brute force substitution will get you a non-optimal estimate,
suffices for instructional purposes.

solution gives a nicely formatted solution
http://www-stat.stanford.edu/~rag/stat209/hw3p5.pdf
----------------------

=======================================================================================

end HW3 solutions 2019