Stat209/HRP239/Ed260A D Rogosa 1/14/18
Note TA office Hours on Monday
This HW has five problems on Week 1 material plus two logistic regression examples (prob 6,7) for review for anyone who wants it
See solutions also posted.

Assignment 1.
Review and some extensions: Multiple Regression

Lab1 will have more (uneccessary?) practice on the basics

1. Yule's Data via Freedman (deep review regression)
Yule (1899), "An investigation into the causes of changes in pauperism in England,
chiefly during the last two intercensal decades."
File yuledoc.dat contains the
data in Table 3, p.10 of Freedman text (1871 to 1881 comp)
or described sec. 4 (p.6) of class reading "Association to Causation" and elsewhere)
http://wwwstat.stanford.edu/~rag/stat209/yuledoc.dat
(note there's some preamble text in the file
I commented out the preamble so it will load without issue,
but it is still good to always look at file before opening)
I scanned and posted p1011 of Freedman's text for reference
on the variables and fit
http://wwwstat.stanford.edu/~rag/stat209/DAFp10.pdf
a. replicate Yule's regression equation for
the metropolitan unions, 187181. (parameters a b c d)
Yule offered a regression equation,
'DELTA"Paup = a + b * "DELTA"Out + c * "DELTA"Old + d * "DELTA"Pop + error.
In this equation,
"DELTA" is percentage change over time,
"Out" is the outrelief ratio N/D,
N = number on welfare outside the poorhouse,
D = number inside,
"Old" is the percentage of the population over 65,
"Pop" is the population.
Data are from the English Censuses of 1871, 1881
(Subtract 100 from each entry to get the percentage changes
cf Freedman text pp1011.)

Arithmetic addendum
Example: a variable has value 50 in 1870 and 60 in 1880
that's an increase of 10units or 20% (the metric used in
the regression equation)
The data in yuledoc.dat reside in a somewhat cryptic form:
in this case the entry would be (60/50)*100 = 120. So
we obtain the (desired) 20% entry by subtracting 100
from the value in yuledoc.dat.
Example 2. value 70 in 1870 and 56 in 1880
yuledoc.dat entry would be (56/70)*100 = 80.
Subtract 100 to get 20 (20% decline)

b. More complex regression review, which you may have done before
orig version problem Freedman p.63, problem 4 asks you to test
that the regression parameters for the variables
change in population over 65 (c) and change in population (d)
are both 0 (i.e. null hypothesis c=d=0 in Yule's regression above )

2. Revisit Coleman data example from week 1 lecture.
part a.
In the discussion of the Coleman data in Chap 13 Green book (Mosteller and Tukey)
their Example 9 commentary suggests trying one school resource variable and one
family demographic variable (instead of the bunch of redundent variables) in predicting vach.
Se what happens with momed with simpler two predictor models:
predictors tverb momed or ssal momed.
Are these regressions "better" than the full model?
What you get from regression critically depends on what else is in the model ??
part b.
As the Coleman data snippet used in the Green book (Mosteller and Tukey)
is only 20 schools (with 5 predictors),
for expository purposes I created a larger artificial data set,
with 320 rows, for a population having the same means and
covariances as the n=20 sample.
http://wwwstat.stanford.edu/~rag/stat209/coleman320.dat
Repeat the multiple regression and adjusted variables
demonstration for momed.
Also as was done in class handout
plot outcome (or adjusted outcome) vs the adjusted predictor
(residuals from momed on the other predictors), to obtain the scatterplot
the scatterplot for the multiple regression weight (cf plot from week 1 materials).

Computation note, data generation for this example
to create the artificial data set with 320 rows
I used the mvrnorm function in R, which requires the
package MASS (part of basic R distribution)
> library(MASS) #brings in the contents of the library
> ?mvrnorm #shows you the helpfile for this useful data generation function
then to obtain a sample of 320 drawn from a mvnormal population
with pop mean and cov the same as the n=20 sample, that data frame called "ed" here
> eddat320 = mvrnorm(n = 320, mean(ed), cov(ed), tol = 1e6, empirical = FALSE)
if empirical = TRUE then the sample moments would match exactly
those in the "ed" dataset

extra bit, Coleman data avPlots
I mentioned in class that John Fox's car package generates the adjusted variable plots
we created for momed in the avPlot command.
Try that out for the base Coleman data n=20.

3. Enrichment for those comfortable with a bit of symbol manipulation
From Class handout, top frame of Week 1 "Math facts" , prove the
result in the 2predictor case that the multiple regression parameter coefficient
of X_1  is identical to the coefficient for Y regressed on the adjusted variable.
Hint: use the formulas on the 'regression recursion'

4. The "Regression Recursion" slide (useful trick) is worth revisiting.
Linked in week 1 materials.
Take the second version (using vars labelled 1 2 3), and use from the Coleman data
vach as var1, momed as var2, and ses as var3. Demonstrate that this relation holds in the
sample for the parameter estimates from the corresponding regressions.

5. Measurement error, single predictor linear regression
Construct a simple artificial data illustration of effects of measurement error
in a single predictor variable on a regression slope. Result is shown in class handout week 1.
Set the reliability coefficient for the predictor variable to be .8. Set the slope for
the perfectly measured predictor to be 1.5. Compare slope for perfectly measured predictor
with the slope using the fallible predictor measurement.
A couple of ways of doing this exercise (your choice)
a. generate true predictor values, predictor error, and outcome variable and do the two regressions
b. use mvrnorm and generate (all at once) outcome, true predictor, fallible predictor, and do the two regressions
c. use the DAAG function errorsINx from (linked in week 1 materials)

Problem 6. Deep Review, logistic regression (dichotomous outcome) Stat141 examples

6A. Logistic Regression example single predictor
Programmer example
A smallscale investigation was undertaken to study the effect of
computer programming experience on ability to complete a complex
programming task, including debugging, within a specified time.
Twentyfive persons were selected for the study. They had varying
amounts of programming experience (measured in months of
experience). All persons were given the same programming task.
The results are coded in binary fashion; if the task was
completed successfully in the allotted time, it was scored 1, and
if the task was not completed successfully, it was scored 0.
Columns are Months of experience and the binary outcome measure
(0,1 indicator) of success.
Data available from (old) Stat141 site
http://wwwstat.stanford.edu/~rag/stat141/ass/program.dat
logistic regression fit for success.
Give estimates of probability of sucess for subjects with 6,
18,30 months of experience.
Compare with OLS fits (i.e. treating 0,1 outcome as a measured variable)
programmer data
1.0  * * * * 2 * * 2 *
success 
+

+

0.00+ 2 * 2 * * * * * * * * *
++++++months
5.0 10.0 15.0 20.0 25.0 30.0
6B. Logistic Regression with two predictors (one group variable), Donner party
Donner party data from the ed401 (Rintro) website
http://www.stanford.edu/~rag/ed401/donner.dat
Data description and problem (redacted) at
http://www.stanford.edu/~rag/stat209/donnerdesc.pdf

7. More Logistics Regression Review (for those that want it)
Logistic Regression from Dalgaard (chap 11 of ISwR book)
In the ISwR package malaria data.
A random sample of 100 children aged 3–15 years from a village in Ghana. The children were
followed for a period of 8 months. At the beginning of the study, values of a particular antibody
were assessed. Based on observations during the study period, the children were categorized into
two groups: individuals with and without symptoms of malaria.
Use Logistic regression to predict malaria from age and antibody level.
What is the probability of malaria for someone of median age, median antibody level?

END HW1 2018