Stat209 Computing Lab 1 Multiple Regression Basics 1/18/16 STAT 209 D Rogosa Computational Lab 1. Some Multiple Regression Basics The content is a hodge-podge of regression activities reviewing class topics and examples. This should be pretty routine/redundant with HW from weeks 1 and 2, i.e. additional review. note: the format of this intro lab is chatter, with R-command and output interspersed. There are four "Tasks" (i.e. sections) to look at TASK 1: Multiple Regression Task 2: adjusted variables interpretation TASK 3: Standardized coefficients TASK 4: Logistic regression review (using babies data) ------------------------------------- You can read through (or read a section) and then try it on your own, just by cutting and pasting commands. Trust that there are more elegant ways of doing any of these tasks, but the intent is basic instruction, not showing off R This lab covers course content weeks 1-2 (also relevant to week 3) actual R commands begin with > # designates a comment in the R-session Data from Verzani UsingR Package babies Mothers and their babies data Description A collection of variables taken for each new mother in a Child and Health Development Study. Usage data(babies) Format A data frame with 1,236 observations on the following 23 variables. Variables in data file id identification number pluralty 5= single fetus outcome 1= live birth that survived at least 28 days date birth date where 1096=January1,1961 gestation length of gestation in days sex infant’s sex 1=male 2=female 9=unknown wt birth weight in ounces (999 unknown) parity total number of previous pregnancies including fetal deaths and still births, 99=unknown race mother’s race 0-5=white 6=mex 7=black 8=asian 9=mixed 99=unknown age mother’s age in years at termination of pregnancy, 99=unknown ed mother’s education 0= less than 8th grade, 1 = 8th -12th grade - did not graduate, 2= HS graduate– no other schooling , 3= HS+trade, 4=HS+some college 5= College graduate, 6&7 Trade school HS unclear, 9=unknown ht mother’s height in inches to the last completed inch 99=unknown wt1 mother prepregnancy wt in pounds, 999=unknown drace father’s race, coding same as mother’s race. dage father’s age, coding same as mother’s age. ded father’s education, coding same as mother’s education. dht father’s height, coding same as for mother’s height dwt father’s weight coding same as for mother’s weight marital 1=married, 2= legally separated, 3= divorced, 4=widowed, 5=never married inc family yearly income in $2500 increments 0 = under 2500, 1=2500-4999, ..., 8= 12,500-14,999, 9=15000+, 98=unknown, 99=not asked smoke does mother smoke? 0=never, 1= smokes now, 2=until current pregnancy, 3=once did, not now, 9=unknown time If mother quit, how long ago? 0=never smoked, 1=still smokes, 2=during current preg, 3=within 1 yr, 4= 1 to 2 years ago, 5= 2 to 3 yr ago, 6= 3 to 4 yrs ago, 7=5 to 9yrs ago, 8=10+yrs ago, 9=quit and don’t know, 98=unknown, 99=not asked number number of cigs smoked per day for past and current smokers 0=never, 1=1-4,2=5-9, 3=10-14, 4=15-19, 5=20-29, 6=30-39, 7=40-60, 8=60+, 9=smoke but don’t know,98=unknown,99=not asked Source This dataset is found from http://www.stat.berkeley.edu/users/statlabs/labs.html. It accompanies the excellent text Stat Labs: Mathematical Statistics through Applications Springer-Verlag (2001) by Deborah Nolan and Terry Speed. We have the data in http://www.stanford.edu/~rag/stat209/babies.dat TASK 1: Multiple Regression Preliminary: Read in babies dataset and get clean subset of variables, > lab1 = read.table("http://www.stanford.edu/~rag/stat209/babies.dat", header = T) > #read in full babies data set > summary(lab1) id pluralty outcome date gestation sex wt Min. : 15 Min. :5 Min. :1 Min. :1350 Min. :148.0 Min. :1 Min. : 55.0 1st Qu.:5286 1st Qu.:5 1st Qu.:1 1st Qu.:1444 1st Qu.:272.0 1st Qu.:1 1st Qu.:108.8 Median :6730 Median :5 Median :1 Median :1540 Median :280.0 Median :1 Median :120.0 Mean :6001 Mean :5 Mean :1 Mean :1536 Mean :286.9 Mean :1 Mean :119.6 3rd Qu.:7583 3rd Qu.:5 3rd Qu.:1 3rd Qu.:1627 3rd Qu.:288.0 3rd Qu.:1 3rd Qu.:131.0 Max. :9263 Max. :5 Max. :1 Max. :1714 Max. :999.0 Max. :1 Max. :176.0 parity race age ed ht wt1 Min. : 0.000 Min. : 0.000 Min. :15.00 Min. :0.000 Min. :53.00 Min. : 87.0 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:23.00 1st Qu.:2.000 1st Qu.:62.00 1st Qu.:115.0 Median : 1.000 Median : 3.000 Median :26.00 Median :2.000 Median :64.00 Median :126.0 Mean : 1.932 Mean : 3.206 Mean :27.37 Mean :2.922 Mean :64.67 Mean :154.0 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.:31.00 3rd Qu.:4.000 3rd Qu.:66.00 3rd Qu.:140.0 Max. :13.000 Max. :99.000 Max. :99.00 Max. :9.000 Max. :99.00 Max. :999.0 drace dage ded dht dwt marital Min. : 0.000 Min. :18.00 Min. :0.000 Min. :60.00 Min. :110.0 Min. :0.000 1st Qu.: 0.000 1st Qu.:25.00 1st Qu.:2.000 1st Qu.:70.00 1st Qu.:165.0 1st Qu.:1.000 Median : 3.000 Median :29.00 Median :4.000 Median :73.00 Median :190.0 Median :1.000 Mean : 3.665 Mean :30.74 Mean :3.189 Mean :81.67 Mean :505.4 Mean :1.038 3rd Qu.: 7.000 3rd Qu.:35.00 3rd Qu.:5.000 3rd Qu.:99.00 3rd Qu.:999.0 3rd Qu.:1.000 Max. :99.000 Max. :99.00 Max. :9.000 Max. :99.00 Max. :999.0 Max. :5.000 inc smoke time number Min. : 0.00 Min. :0.0000 Min. : 0.000 Min. : 0.000 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 0.000 Median : 4.00 Median :1.0000 Median : 1.000 Median : 1.000 Mean :13.16 Mean :0.8681 Mean : 1.748 Mean : 2.604 3rd Qu.: 7.00 3rd Qu.:1.0000 3rd Qu.: 1.000 3rd Qu.: 3.000 Max. :98.00 Max. :9.0000 Max. :99.000 Max. :98.000 > dim(lab1)# dim tells you 1236 subjects, 23 vars [1] 1236 23 > # but we have lots of missing data. Because this is a "basics" exercise, we will just use cases with complete data, rather than imputation with "mice" package or equivalent ## note lm will just discard cases with missing data as default > # since we have missing data, for the purposes of this Lab we want > # to clean those cases out and use a data set with complete data > #vars we want to use --somewhat arbitrary choices > # gestation length of gestation in days > # wt birth weight in ounces, 999 unknown > # age mother’s age in years at termination of pregnancy, 99=unknown > # ht mother’s height in inches to the last completed inch, 99=unknown > # wt1 mother pre-pregnancy weight in pounds, 999=unknown > # dht father's height, coding same as for mother's height > # dwt father's weight coding same as for mother's weight > > #create a new data set, a subset of the full data the subset command is very useful (though there are shortcuts for this) > subsetBabies = subset( lab1, subset = gestation < 999 & wt1 < 999 & wt < 999 & ht < 99 & dwt < 999 & dht < 99 & age <99, select = c(gestation, wt, age, ht, wt1, dht, dwt) ) > dim(subsetBabies) [1] 705 7 > # so only 705 of the cases have complete data on the variables of interest, birthweight as outcome and 6 predictors # and none have missing data codes as seen below > summary(subsetBabies) gestation wt age ht wt1 dht Min. :148.0 Min. : 55.0 Min. :15.00 Min. :54.00 Min. : 87.0 Min. :60.00 1st Qu.:272.0 1st Qu.:108.0 1st Qu.:23.00 1st Qu.:62.00 1st Qu.:115.0 1st Qu.:68.00 Median :280.0 Median :120.0 Median :26.00 Median :64.00 Median :125.0 Median :71.00 Mean :279.2 Mean :119.5 Mean :27.39 Mean :64.08 Mean :128.8 Mean :70.26 3rd Qu.:288.0 3rd Qu.:131.0 3rd Qu.:31.00 3rd Qu.:66.00 3rd Qu.:140.0 3rd Qu.:72.00 Max. :353.0 Max. :176.0 Max. :43.00 Max. :72.00 Max. :250.0 Max. :78.00 dwt Min. :110.0 1st Qu.:155.0 Median :170.0 Mean :171.2 3rd Qu.:185.0 Max. :260.0 # to get the 7x7 array of scatterplots use command pairs(subsetBabies) # see MB text 3rd ed p.175 Ireland race data, 2nd ed p.178 for 3x3 example, mice data > attach(subsetBabies) # so we can refer directly to variable names in dataset > names(subsetBabies) [1] "gestation" "wt" "age" "ht" "wt1" "dht" "dwt" # wt will be the outcome variable in the regression demonstrations > #just to get a feel for the data, here's 5-number summary for wt, and correlations among all vars > quantile(wt) 0% 25% 50% 75% 100% 55 108 120 131 176 > length(wt) # number of cases [1] 705 > cor(subsetBabies) # correlation matrix gestation wt age ht wt1 dht dwt gestation 1.00000000 0.39500543 -0.059611693 0.02581315 0.06386776 0.01191594 0.010700934 wt 0.39500543 1.00000000 0.047794571 0.21614525 0.16249354 0.09822157 0.141033572 age -0.05961169 0.04779457 1.000000000 0.01852681 0.19244242 -0.05694164 0.003239663 ht 0.02581315 0.21614525 0.018526813 1.00000000 0.42233662 0.33288888 0.258235062 wt1 0.06386776 0.16249354 0.192442416 0.42233662 1.00000000 0.13149788 0.197935408 dht 0.01191594 0.09822157 -0.056941642 0.33288888 0.13149788 1.00000000 0.542304515 dwt 0.01070093 0.14103357 0.003239663 0.25823506 0.19793541 0.54230452 1.000000000 > > > # Multiple Regression using the data > > fullreg = lm( wt ~ gestation + age + ht + wt1 + dht + dwt, data= subsetBabies) > summary(fullreg) Call: lm(formula = wt ~ gestation + age + ht + wt1 + dht + dwt, data = subsetBabies) Residuals: Min 1Q Median 3Q Max -48.6889 -10.6368 0.2207 10.1633 54.7421 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -101.00106 22.93807 -4.403 1.23e-05 *** gestation 0.45077 0.03893 11.578 < 2e-16 *** age 0.18464 0.10706 1.725 0.0850 . ht 1.23605 0.28406 4.351 1.56e-05 *** wt1 0.03357 0.03396 0.989 0.3232 dht -0.09882 0.26615 -0.371 0.7105 dwt 0.07619 0.03296 2.312 0.0211 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 16.42 on 698 degrees of freedom Multiple R-Squared: 0.2118, Adjusted R-squared: 0.2051 F-statistic: 31.27 on 6 and 698 DF, p-value: < 2.2e-16 > # Rsq not high, 2 highly sig coeff, gestation, mom's ht, also signif dad wt predicting baby wt # so it seems at first glance "important" predictors are mom and dad size and baby being (near-to) full term ------------------------------- more Week 1 content ########################################################################### Task 2: use adjusted variables interpretation to reproduce gestation coefficient in unstandardized regression ############################################################################ > # do adjusted variables stuff (here we also reamed out the outcome variable wt, get same regr coeff using wt or wt adjusted for other preds) > wtdotnogest = lm( wt ~ age + ht + wt1 + dht + dwt, data= subsetBabies) > gestdotpreds = lm( gestation ~ age + ht + wt1 + dht + dwt, data= subsetBabies) > gestcoeffreg = lm(residuals(wtdotnogest) ~ residuals(gestdotpreds)) > summary(gestcoeffreg) Call: lm(formula = residuals(wtdotnogest) ~ residuals(gestdotpreds)) Residuals: Min 1Q Median 3Q Max -48.6889 -10.6368 0.2207 10.1633 54.7421 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.941e-16 6.162e-01 -3.15e-16 1 residuals(gestdotpreds) 4.508e-01 3.879e-02 11.62 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 16.36 on 703 degrees of freedom Multiple R-Squared: 0.1611, Adjusted R-squared: 0.1599 F-statistic: 135 on 1 and 703 DF, p-value: < 2.2e-16 > coefficients(gestcoeffreg)[2] residuals(gestdotpreds) 0.4507681 > coefficients(fullreg)[2] gestation 0.4507681 #these coeffs match, as should be. Also coeff standard errors and t-stat match up to df slightly different. #relevant plot at http://www-stat.stanford.edu/~rag/stat209/09lab1plot.pdf > plot(residuals(gestdotpreds), residuals(wtdotnogest)) > #you can see there are a few observations you would want to look at closer as a few points flatten this slope considerably plot(residuals(gestdotpreds), residuals(wtdotnogest), pch = 20) #better plot character,small dot or Acrobat often makes it a diamond see MB plate 1 ## also as seen in HW1 problem 2, avPlots from car package do this nicely. try it if you like. ====================================================== ################################################################# TASK 3 > # For weeks 2,3: Standardized coefficients, Parts b and c used in week 3 path analysis exs ############################################################### > a. standardize each variable to mean 0, sd 1 and run the multiple regression ######### # built-in function "scale" will do this #?scale gives you scale package:base R Documentation Scaling and Centering of Matrix-like Objects Description: 'scale' is generic function whose default method centers and/or scales the columns of a numeric matrix. Usage: scale(x, center = TRUE, scale = TRUE) Arguments: x: a numeric matrix(like object). center: either a logical value or a numeric vector of length equal to the number of columns of 'x'. scale: either a logical value or a numeric vector of length equal to the number of columns of 'x'. --------------------------- Here's a little aside demo of scale using the HW1 problem 2 data > coleman320dat = read.table("http://www-stat.stanford.edu/~rag/stat209/coleman320.dat", header = T ) > head(coleman320dat) #first 6 schools ssal whcol ses tverb momed vach 1 3.117 46.01 6.818 27.11 6.254 39.53 2 3.279 14.04 -14.060 26.05 5.242 28.86 3 2.132 -27.18 -14.510 23.40 4.086 25.02 4 2.821 57.21 8.022 23.93 6.689 34.63 5 2.241 62.79 8.896 24.60 7.216 35.34 6 2.873 17.71 -9.227 25.18 5.811 29.55 > stcdat = scale(coleman320dat) #standardize all the variables to have mean 0, var 1 ###### with r 3.0 it is safer to do this as sapply(coleman320dat, scale) but either seems to work > head(stcdat) # values after standardizing ssal whcol ses tverb momed vach 1 0.9358639 0.2390086 0.4227210 1.5535123 0.07016417 0.74912638 2 1.2804198 -0.9523682 -1.6859888 0.7655241 -1.43490082 -1.01945935 3 -1.1591207 -2.4884505 -1.7314395 -1.2044465 -3.15412525 -1.65595130 4 0.3063051 0.6563818 0.5443268 -0.8104524 0.71710417 -0.06306388 5 -0.9272899 0.8643231 0.6326022 -0.3123844 1.50086825 0.05462083 6 0.4169033 -0.8156040 -1.1978485 0.1187790 -0.58867356 -0.90508970 # be compulsive and check that this was done right > summary(coleman320dat) # get means ssal whcol ses tverb Median :2.718 Median : 40.20 Median : 1.895 Median :25.08 Mean :2.677 Mean : 39.60 Mean : 2.633 Mean :25.02 momed vach Median :6.176 Median :34.46 Mean :6.207 Mean :35.01 > sqrt(var(coleman320dat)) # diagonal elements are the variable standard deviations ssal whcol ses tverb momed vach ssal 0.4701705 1.546910 0.9865855 0.5874447 0.2512979 0.7691175 whcol 1.5469102 26.834499 14.7372487 1.6011349 4.0768963 11.0696736 ses 0.9865855 14.737249 9.9008406 1.5767437 2.3161102 7.4659055 tverb 0.5874447 1.601135 1.5767437 1.3451978 0.3575505 1.6829610 momed 0.2512979 4.076896 2.3161102 0.3575505 0.6723962 1.7188997 vach 0.7691175 11.069674 7.4659055 1.6829610 1.7188997 6.0330692 check if you like that stvar = (var - varmean)/varsd I checked a couple for you Back to Babies data ---------------------------------------------------------------------------------- OR For those of you who enjoy writing your own functions...... > # here's a little loop (written by Karen Kapur) that creates a new data set, standardizedVars > standardizedVars = subsetBabies > numVars = dim( standardizedVars )[2] > for ( i in 1:numVars ) { + meanVal = mean( standardizedVars[,i] ) + sdVal = sd( standardizedVars[,i] ) + standardizedVars[,i] = ( standardizedVars[,i] - meanVal ) / sdVal + } #see for ref Verzani online primer http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf # page 50 of pdf = page number 46 in that doc # else you could just standardize one-by-one ------------------------------------------------ Anyway you want to produce a dataset called "standardizedVars" > standardizedVars = sapply(subsetBabies, scale) > head(standardizedVars) gestation wt age ht wt1 dht dwt [1,] 0.3023269 0.02603474 -0.0657346 -0.8202766 -1.3897229 -1.83830506 -2.7023243 [2,] 0.1770975 -0.35408782 0.9453830 -0.0302561 0.2981789 -0.09119658 -1.0230873 [3,] 0.4275564 0.89488629 -0.4027738 -0.8202766 -1.7273032 -2.18772675 -1.8185153 [4,] -2.1396470 0.67767340 -0.7398130 0.3647541 0.5393077 0.25822511 0.9212925 [5,] 0.6154005 0.02603474 -0.4027738 -0.8202766 -0.1840788 -0.09119658 0.3910071 [6,] 0.1770975 1.32931206 0.7768634 -0.0302561 -0.2323045 1.30649019 0.6119594 > summary(standardizedVars) gestation wt age ht wt1 Min. :-8.213e+00 Min. :-3.504e+00 Min. :-2.088e+00 Min. :-3.980e+00 Min. :-2.017e+00 1st Qu.:-4.490e-01 1st Qu.:-6.256e-01 1st Qu.:-7.398e-01 1st Qu.:-8.203e-01 1st Qu.:-6.663e-01 Median : 5.187e-02 Median : 2.603e-02 Median :-2.343e-01 Median :-3.026e-02 Median :-1.841e-01 Mean : 6.870e-16 Mean :-2.073e-16 Mean :-2.607e-16 Mean : 2.672e-15 Mean : 2.888e-16 3rd Qu.: 5.528e-01 3rd Qu.: 6.234e-01 3rd Qu.: 6.083e-01 3rd Qu.: 7.598e-01 3rd Qu.: 5.393e-01 Max. : 4.623e+00 Max. : 3.067e+00 Max. : 2.631e+00 Max. : 3.130e+00 Max. : 5.844e+00 dht dwt Min. :-3.585e+00 Min. :-2.702e+00 1st Qu.:-7.900e-01 1st Qu.:-7.138e-01 Median : 2.582e-01 Median :-5.090e-02 Mean :-1.671e-15 Mean : 2.145e-16 3rd Qu.: 6.076e-01 3rd Qu.: 6.120e-01 Max. : 2.704e+00 Max. : 3.926e+00 > cor(standardizedVars) gestation wt age ht wt1 dht dwt gestation 1.00000000 0.39500543 -0.059611693 0.02581315 0.06386776 0.01191594 0.010700934 wt 0.39500543 1.00000000 0.047794571 0.21614525 0.16249354 0.09822157 0.141033572 age -0.05961169 0.04779457 1.000000000 0.01852681 0.19244242 -0.05694164 0.003239663 ht 0.02581315 0.21614525 0.018526813 1.00000000 0.42233662 0.33288888 0.258235062 wt1 0.06386776 0.16249354 0.192442416 0.42233662 1.00000000 0.13149788 0.197935408 dht 0.01191594 0.09822157 -0.056941642 0.33288888 0.13149788 1.00000000 0.542304515 dwt 0.01070093 0.14103357 0.003239663 0.25823506 0.19793541 0.54230452 1.000000000 > cov(standardizedVars) gestation wt age ht wt1 dht dwt gestation 1.00000000 0.39500543 -0.059611693 0.02581315 0.06386776 0.01191594 0.010700934 wt 0.39500543 1.00000000 0.047794571 0.21614525 0.16249354 0.09822157 0.141033572 age -0.05961169 0.04779457 1.000000000 0.01852681 0.19244242 -0.05694164 0.003239663 ht 0.02581315 0.21614525 0.018526813 1.00000000 0.42233662 0.33288888 0.258235062 wt1 0.06386776 0.16249354 0.192442416 0.42233662 1.00000000 0.13149788 0.197935408 dht 0.01191594 0.09822157 -0.056941642 0.33288888 0.13149788 1.00000000 0.542304515 dwt 0.01070093 0.14103357 0.003239663 0.25823506 0.19793541 0.54230452 1.000000000 > summary(standardizedVars) gestation wt age ht wt1 dht dwt Min. :-8.21327 Min. :-3.50367 Min. :-2.0880 Min. :-3.98036 Min. :-2.0167 Min. :-3.5854 Min. :-2.7023 1st Qu.:-0.44905 1st Qu.:-0.62560 1st Qu.:-0.7398 1st Qu.:-0.82028 1st Qu.:-0.6663 1st Qu.:-0.7900 1st Qu.:-0.7138 Median : 0.05187 Median : 0.02603 Median :-0.2343 Median :-0.03026 Median :-0.1841 Median : 0.2582 Median :-0.0509 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 3rd Qu.: 0.55279 3rd Qu.: 0.62337 3rd Qu.: 0.6083 3rd Qu.: 0.75976 3rd Qu.: 0.5393 3rd Qu.: 0.6076 3rd Qu.: 0.6120 Max. : 4.62274 Max. : 3.06702 Max. : 2.6306 Max. : 3.12983 Max. : 5.8441 Max. : 2.7042 Max. : 3.9262 > # means and sd's check, also cov matrix for standardized vars is the corr matrix for original vars > # regression for standardized vars > stdreg = lm( wt ~ gestation + age + ht + wt1 + dht + dwt - 1, data = standardizedVars ) # no intercept, since response standardized > summary( stdreg ) Call: lm(formula = wt ~ gestation + age + ht + wt1 + dht + dwt - 1, data = standardizedVars) Residuals: Min 1Q Median 3Q Max -2.64396 -0.57761 0.01199 0.55190 2.97267 Coefficients: Estimate Std. Error t value Pr(>|t|) gestation 0.39093 0.03374 11.587 < 2e-16 *** age 0.05950 0.03448 1.726 0.0848 . ht 0.16992 0.03902 4.354 1.53e-05 *** wt1 0.03780 0.03821 0.989 0.3229 dht -0.01536 0.04133 -0.372 0.7103 dwt 0.09362 0.04047 2.313 0.0210 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.891 on 699 degrees of freedom Multiple R-Squared: 0.2118, Adjusted R-squared: 0.2051 F-statistic: 31.31 on 6 and 699 DF, p-value: < 2.2e-16 > # you can see same Rsq, adjRsq as before, and same significance for coeffs > stdreg$coef gestation age ht wt1 dht dwt 0.39093303 0.05949877 0.16992335 0.03779855 -0.01535762 0.09362414 #simple relation between standardized and unstandardized coefficients (ratio of sd's #as indicated by in-class formula for regression params) > fullreg$coef[2] #unstandardized regression gestation 0.4507681 > fullreg$coef[2]*sd(gestation)/sd(wt) #formula for standardized coeff, kill the scale gestation 0.390933 # matches above ### for Week 3, we haven't done this yet c.f backside of standardized Var examples handout > # illustrate week 3 computations, path analysis, often only correlation matrix, not actual data are available. > # Week 2 and week 3 class handouts, exs, show the calculations Computing based on Week 2 sttandardized regression and week 3 (Tues, path analysis) content > #### > # b. Coefficient estimates obtained from the correlations > #### > > corPredictors = cor( cbind( gestation, age, ht, wt1, dht, dwt ) ) > corPredictors #6x6 matrix gestation age ht wt1 dht dwt gestation 1.00000000 -0.059611693 0.02581315 0.06386776 0.01191594 0.010700934 age -0.05961169 1.000000000 0.01852681 0.19244242 -0.05694164 0.003239663 ht 0.02581315 0.018526813 1.00000000 0.42233662 0.33288888 0.258235062 wt1 0.06386776 0.192442416 0.42233662 1.00000000 0.13149788 0.197935408 dht 0.01191594 -0.056941642 0.33288888 0.13149788 1.00000000 0.542304515 dwt 0.01070093 0.003239663 0.25823506 0.19793541 0.54230452 1.000000000 > corPredResponse = cbind( cor( cbind(gestation, age, ht, wt1, dht, dwt), wt ) ) > corPredResponse # 6x1 matrix (vector) [,1] gestation 0.39500543 age 0.04779457 ht 0.21614525 wt1 0.16249354 dht 0.09822157 dwt 0.14103357 > coefEsts = solve( corPredictors ) %*% corPredResponse # the "n" terms cancel > # Notice that these are equal to stdreg$coef > coefEsts [,1] gestation 0.39093303 age 0.05949877 ht 0.16992335 wt1 0.03779855 dht -0.01535762 dwt 0.09362414 #or to match better the format of the path analysis example > coefEstst = t(corPredResponse) %*% solve( corPredictors ) > coefEstst gestation age ht wt1 dht dwt [1,] 0.390933 0.05949877 0.1699233 0.03779855 -0.01535762 0.09362414 # match Rsq (squared multiple correlation) > rsq = t(corPredResponse)%*%t(coefEstst) > rsq #matches st regr output [,1] [1,] 0.2118302 > coefEstst%*%corPredResponse #less labored version, match dimensions [,1] [1,] 0.2118302 #to get the adjusted Rsq which penalizes for using more predictors > numObs = length( wt ) #this is "n" > numPreds = 6 #this is "p" 1- (1-rsq)*(numObs / (numObs - numPreds) ) [,1] [1,] 0.2050648 > #### > # c. Standard errors using correlations > #### # I did this in pieces (derivation on p79 of Freedman text) inserting the proper sample size adjustments at the end > 1 - rsq [,1] [1,] 0.7881698 > covB = (.78817)*solve(corPredictors) # need to adjust this below for sample size > covB gestation age ht wt1 dht dwt gestation 0.795731803 0.05978970 0.005351763 -0.06511381 -0.001152706 0.003422662 age 0.059789703 0.83080745 0.044812488 -0.18858210 0.062777851 -0.011621135 ht 0.005351763 0.04481249 1.064445939 -0.41368588 -0.272960029 -0.045169169 wt1 -0.065113813 -0.18858210 -0.413685884 1.02038640 0.062979324 -0.127988650 dht -0.001152706 0.06277785 -0.272960029 0.06297932 1.194182155 -0.589779407 dwt 0.003422662 -0.01162113 -0.045169169 -0.12798865 -0.589779407 1.145008807 #standard errors for standardized coeffs are just square roots of the diagonals adjusted for sample size > sqrt((covB[1,1]/(numObs ))*(numObs / (numObs - numPreds) )) [1] 0.03373998 > sqrt((covB[2,2]/(numObs ))*(numObs / (numObs - numPreds) )) [1] 0.03447558 > sqrt((covB[3,3]/(numObs ))*(numObs / (numObs - numPreds) )) [1] 0.03902323 > sqrt((covB[4,4]/(numObs ))*(numObs / (numObs - numPreds) )) [1] 0.03820707 > sqrt((covB[5,5]/(numObs ))*(numObs / (numObs - numPreds) )) [1] 0.04133298 > sqrt((covB[6,6]/(numObs ))*(numObs / (numObs - numPreds) )) [1] 0.04047304 #Compare with regression using the actual data > summary( stdreg ) # compare s.e. estimates from regressions run on the data Coefficients: Estimate Std. Error t value Pr(>|t|) gestation 0.39093 0.03374 11.587 < 2e-16 *** age 0.05950 0.03448 1.726 0.0848 . ht 0.16992 0.03902 4.354 1.53e-05 *** wt1 0.03780 0.03821 0.989 0.3229 dht -0.01536 0.04133 -0.372 0.7103 dwt 0.09362 0.04047 2.313 0.0210 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.891 on 699 degrees of freedom Multiple R-Squared: 0.2118, Adjusted R-squared: 0.2051 F-statistic: 31.31 on 6 and 699 DF, p-value: < 2.2e-16 ------------------------------------------------ ################################################## TASK 4 Logistic regression review (using babies data) ################################################# #For dichotomous response variable use logistic regression. To illustrate, #Create a variable to indicate whether or not the baby is premature (a birth is considered premature #if the gestation period is less than 37 full weeks). Notice that now our response variable is #categorical and our predictor variable is continuous. Use logistic regression to model how the # probability of a premature baby depends on the baby’s weight at birth. > subsetBabies$preemie = as.numeric(subsetBabies$gestation < 7*37) # make the 0,1 indicator, add to dataframe > head(subsetBabies) gestation wt age ht wt1 dht dwt preemie 1 284 120 27 62 100 65 110 0 2 282 113 33 64 135 70 148 0 6 286 136 25 62 93 64 130 0 8 245 132 23 65 140 71 192 1 9 289 120 25 62 125 70 180 0 12 282 144 32 64 124 74 185 0 > attach(subsetBabies) > table(preemie) preemie 0 1 650 55 > dim(subsetBabies) [1] 705 8 > lreg = glm(preemie ~ wt, family = binomial, data = subsetBabies) > summary(lreg) Call: glm(formula = preemie ~ wt, family = binomial, data = subsetBabies) Deviance Residuals: Min 1Q Median 3Q Max -1.2662 -0.3971 -0.2647 -0.1695 3.1205 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.204217 0.946504 5.498 3.83e-08 *** wt -0.069415 0.009012 -7.702 1.34e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 386.19 on 704 degrees of freedom Residual deviance: 311.97 on 703 degrees of freedom AIC: 315.97 Number of Fisher Scoring iterations: 6 #plot the data and the logistic fit > plot(wt, preemie, xlab = "Baby's weight", ylab = "Predicted probability of a premature baby", xlim = range(0, 180)) > curve(exp(5.204 -.0694*x)/(1+exp(5.204 - .0694*x)), add = TRUE,1, 180) # Using the information from the model fit, what is the probability that the baby is premature given that the birth weight = 100? > predict.glm(lreg, type="response", newdata = data.frame(wt = 100)) 1 0.1496545 > -------------------------------------------------------- ------------------------------------------------------- End Lab 1 January 2016