Stat 222 2018 Problem Set

Course Problem Set 2018

Due in class Dec 6, 2018

Usual Honor Code procedures: You may use any of your own inanimate resources--no collaboration or assistance from others. This work is done under Stanford's Honor Code.
Solutions for these problems are to be submitted in hard-copy form. Given that these problems are untimed, some care should be taken in presentation, clarity, format. Especially important is to give full and clear answers to questions, not just to submit unannotated computer output, although relevant output should be included.
PLEASE check that you have answered all the parts and subparts. There are just three problems for this problem set; please start each problem on a new page and keep all material for a problem contiguous. It's fine, for example, to blend notebook paper with printed output, just keep it all together.
Please ask (rag@stanford.edu) about issues of question interpretation, and especially in regard to any materials you feel you need but don't have access to.
Any issues that come up (wording, interpretation) I will post a note here, so it would be good to check this page intermittently.

Sidenote. For those of you enrolled for 3 units we need to set up presentations. I will ask on Thurs about the date 12/4 3:30 onward.

Course Problem Set, Item 1.
   Change over time, measured outcome
Week 2, Exercise 1
Tolerance data [note: 10/12/17 data location updated]
A small subsample of data (16 respondents) from the National Youth Survey is obtained in long-form by
read.table("https://stats.idre.ucla.edu/wp-content/uploads/2016/02/tolerance1_pp.txt", sep=",", header=T)
and in wide form by
read.table("https://stats.idre.ucla.edu/wp-content/uploads/2016/02/tolerance1.txt", sep=",", header=T)
Yearly observations from ages 11 to 15 on the tolerance measure (tolerance to deviant behavior e.g. cheat, drug, steal, beat; larger values indicates more tolerance on a 1to4 scale). Also in this data set are gender (is_male) and an exposure measure obtained at age 11 (self report of close friends involvement in deviant behaviors). note: the time measure is age - 11.
i. obtain individual OLS fits (tolerance over time) and plot the collection of those straight-lines. Provide descriptive statistic summaries for the rate of change in tolerance and initial level.
ii. fit a mixed effects model for tolerance over time (unconditional) for this collection of individuals. Obtain interval estimates for the fixed and random effects. Show that the fixed effects estimates correspond to quantities obtained in part i. Explain.
iii. Investigate whether the exposure measure is a useful predictor of level or rate of change in tolerance. What appears to be the best fitting mixed model for these data using these measures? Show specifics.

Course Problem Set, Item 2.
   Comparison of experimental groups--placebo-controlled, randomized study
Week 5, Exercise 2
      note: problem text reformatted for clarity
Treatment of Lead Exposed Children (TLC) Trial. Data (wide form) and description reside at Laird-Ware text site
Just wide-form data with no column headers
a.Start out by just using the subset of the longitudinal data: Lead Level Week 0 and Week 6 (i.e. two-wave data).
(i) Carry out a statistical analysis for estimating the relative effectiveness of chelation treatment (succimer) compared with placebo (A,P) using mixed-effects models or repeated measures anova.
(ii) Show the the equivalence from the Brogan-Kutner paper between the simple t-test on improvement(change) and your analysis in part (i).
b. Finally use all 4 longitudinal measures (weeks 0,1,4,6) for a Active vs Placebo comparison using lmer. Compare with the results in part (a) that use only 2 observations.

Course Problem Set, Item 3.
   Basic Survival analysis, right censoring.
Week 7 Exercise 2 continued as Week 8 Exercise 2       note: problem here slightly edited (some deletions) from weekly version.

2. Melanoma data. In package ISwR data melanom {ISwR} Survival after malignant melanoma Description: The melanom data frame has 205 rows and 7 columns. It contains data relating to the survival of patients after an operation for malignant melanoma, collected at Odense University Hospital by K.T. Drzewiecki.

 > str(melanom)
'data.frame':   205 obs. of  6 variables:
 $ no    : int  789 13 97 16 21 469 685 7 932 944 ...
 $ status: int  3 3 2 3 1 1 1 1 3 1 ...
 $ days  : int  10 30 35 99 185 204 210 232 232 279 ...
 $ ulc   : int  1 2 2 2 1 1 1 1 1 1 ...
 $ thick : int  676 65 134 290 1208 484 516 1288 322 741 ...
 $ sex   : int  2 2 2 1 2 2 2 2 1 1 ... ,

We are interested in
days: time on study after operation for malignant melanoma
status: the patient's status at the end of study
Documentation shows the possible values of status are: 1: dead from malignant melanoma 2: alive at end of study 3: dead from other causes. Consider 'dead from other causes' as censored (along with alive). Thus you can either recode status as (1,0) or use a logical for status vector to be status == 1 and the survival object is Surv(days, status == 1) (to do some of the problem for you).
a. How many survival times are censored? Obtain an estimate of the survival curve at each event time (along with CI) using the Kaplan-Meier estimate and plot the survival curve and confidence interval.
b. Does survival differ in men and women? Plot the male and female survival curves. Compare asymptotic (log-rank) and exact tests [see note below] for gender differences. Compare the exact test with a bootstrap approximation.
--------------------------------
note part b. Exact tests and memory limits.
In Week 7 class examples (aml/leukemia data from miller) and in Week 7 RQ 2 (rat data) we showed the use of an exact test for 2-group survival comparisons. These were relatively small data sets (rat has 40 subjects). The melanoma data is larger-- 205 subjects, not a giant data set. But depending on the size of your machine (such as a laptop) you may well not have enough memory to carry out the exact test. Even with a moderately large machine, the default maximum memory limit set in R may be too low (easy enough to change, but maybe not worth the trouble for this exercise). So if you run into problems conducting the exact test, it is entirely adequate to show the memory failure and then revert to the approximate bootstrap option shown in the class example and in Week 7 RQ2 (that's kind of what it is there for). Try the bootstrap approximation option with 1000 and then 10000 replications and see if the results match.
------------------------------
c. Use Cox regression to carry out the gender comparison of the survival curves in part b. Obtain a confidence interval for the effect of gender on the hazard.

continued with week 8 problem
2. melanom data from week 7, exercise 2. Define the censoring as was done in that problem. I found it useful to make a 0,1 variable isMale from the integer sex designation and make a 0,1 variable isUlcer from the ulceration variable (careful there).
a. Repeat the gender comparison in parts b or c in Ex 2, week 7, stratifying on ulceration of the tumor (or not). Compare with the result in Ex 2 week 7 and interpret.
b. Carry out a Cox regression using predictors log(thick) and the gender indicator, stratifying on ulceration. Interpret the results. Check the viability of the proportional hazards assumption for this Cox model.

end 2018 Course Problem Set