Problem 2 Dose (education), Response (wage)
Is your education worth it? worth anything? Only the economists (claim to) know.
But we do have some (old) data that is used alot at present. It really does pertain to your grandparents (almost).
Keep in mind for the salary data that one dollar in 1976 equates to about $4.25 in 2017.
-------------------------------------
card.data {ivmodel} R Documentation
Card (1995) Data
Description: Data from the National Longitudinal Survey of Young Men (NLSYM) that was used by Card (1995).
Usage data(card.data)
Format A data frame with 3010 observations on 35 variables.
for full description ?card.data or ivmodel manual page
-------------------------------------
The ivmodel package is from a friend of this course, Hyunseung Kang co-Author and Maintainer, package ivmodel https://cran.r-project.org/web/packages/ivmodel/ivmodel.pdf
vignette: to appear Journal of Statistical Software http://www.stat.wisc.edu/~hyunseung/ivmodel.pdf
ivmodel: An R Package for Inference and Sensitivity Analysis of Instrumental Variables Models with One Endogenous Variable
The ivmodel vignette section 7 uses the Card data; these data are available in the ivmodel package (and many other R locations)
-------------------------------------------------------------------------------------------------------------------------------
The objective of all the various analyses of these data are a basic dose-response conclusion in the form of returns to schooling.
Note dose-response the topic of Week 7 Computing Corner
Dose is amount of education. In the data set
educ subject's years of education (an integer from 1 to 18)
Response (outcome) is wages in cents per hour in 1976: outcome variable used is lwage.
The dataset contains many interesting and useful measures, with the most useful having alot of missing data (e.g. fatheduc, motheduc, IQ)
and therefore are not employed in this problem exercise. (In real life I'd probably do some mice or equivalents at certain points, but here
I try to keep things simpler.)
part a. Basic dose-response.
i. A stat60 student might investigate the question: what is the change in "response" log-wage (lwage) for a unit change in "dose" (educ).
Does a straight-line fit (constant effect of a unit change in educ) appear adequate in these data? What is the prima facie dose-response
curve and the effect of increasing the dose one unit? Give a point-estimate and confidence interval.
Although the age range (in 1976) is from 24 to 34, almost 10% of the subjects are indicated to be enrolled in college in 1976
(that seems to be never noted in the many many analyses of these data).
-------------------------------------
enroll indicator for whether subject is enrolled in college in 1976
table(cd$enroll, cd$age) # I abbreviate card.data as cd
24 25 26 27 28 29 30 31 32 33 34
0 355 329 346 298 278 208 183 155 206 179 195
1 40 43 40 41 34 25 14 11 7 13 10
> table(cd$enroll)
0 1
2732 278
------------------------------------
I would submit that a student enrolled (part or full time) in college is in a different career domain and salary
trajectory than non-students say in full-time jobs (i.e. enrolled may be in temporary low-paying jobs but have high educ).
So I want to set those enrolled students aside and proceed with the 2732 (out of 3010) not enrolled in college in 1976.
ii. Repeat part a(i) for this subsample (I call it cd_1) of non-enrolled adults. Do the results change?
iii. What problems do you see with forming conclusions from the OLS dose-response estimation in parts i or ii ?
What condition(s) would have to hold for the part a analyses to be taken seriously?
part b. Regression Adjustments (a la ancova)
The most common attempt (seen repeatedly in current publications) to heal any problems with the part a analysis you may have identified
is to toss in additional predictors (in analog to analysis of covariance for a binary dose or treatment variable).
With the subsample cd_1 in part a(ii) try the set of additional predictors (some used by Kang) exper , expersq , black , south , momdad14.
Compare your results for returns-to-education with those in part a.
Two notes.
First, one would like to use fatheduc, motheduc but there's too much missing data, 732 cases, to make that worthwhile
without extra imputation and neither is a significant predictor if that matters.
Second, age is confounded with exper to the extent you will get a singularity result if you try to add age to the prediction equation.
If you use age instead of the experience measures, does your result for returns-to-education change?
part c. Instrumental Variables Estimates
Economists take a different approach to improving the part a analysis, and these Instrumental Variables methods were introduced in Weeks 8 and 9.
Kang uses (as others have done) the instrument nearc4. Carry out that IV estimation (with no covariates) and give a point and interval estimate
for the dose-response (educ-lwage) relationship (use cd_1). Compare the results with parts a and b.
What properties, assumptions must the instrument nearc4 satisfy for this analysis?
part d. Extreme groups on Education (< 11, > 16)
Another very common approach in epidemiology is to take a continuous dose measure, segment the dose into quintiles, and estimate the relation with outcome
(measured outcome, binary outcome, even survival analysis). Featured results are reported for comparison of the extremes (highest versus lowest dose groups),
employing the interpolation of the regression over all doses. I suppose the purpose is better (more vivid) news headlines.
To take a highly critical look at this standard practice from our matching prospective, construct some groups based on education level.
Because educ variable is an integer (try table(educ)) and lumpy (12 HS grad, and 16 college grad) quintiles won't work well.
Look at signif(table(cd_1$educ)/2732, 3)
My best choice was educ levels (< 11, > 16) with sizes more like eighths than fifths, but at least approximately equal.
> sum(cd_1$educ < 11) # use cd_1 throughout the problem
[1] 337
> sum(cd_1$educ >16)
[1] 293
Keep in mind the two groups are "did not complete HS" and "more than 4-year college" , likely quite different people.
Try a simple outcome comparison (e.g. t-test) which often goes by the name "unadjusted" comparison for the extreme groups. Also try an
analysis of covariance analogous with part b, but here you have a binary group (H vs L), often called the "adjusted" analysis. Compare results
with parts a,b,c.
------------------------------------------
In case it is helpful (many of you might do this fancier) to get a data subset containing the 2 extreme groups, I did (and checked)
> cd_2 = cd_1[cd_1$educ > 16 | cd_1$educ < 11,]
> dim(cd_2)
[1] 630 35
> 337 + 293
[1] 630
> table(cd_2$educ)
1 2 3 4 5 6 7 8 9 10 17 18
1 2 3 3 10 16 29 68 80 125 88 205
> cd_2$G = as.numeric(cd_2$educ >16) # G = 1 is top group educ >16
> table(cd_2$G)
0 1
337 293
---------------------------------------
part e. Matching estimators for binary groups (L,H dose of education)
For the individuals in cd_1 that belong to the <11 or >16 groups, use matching estimators for this binary High Low education comparison to
assess the returns to schooling (outcome lwage).
(i) Try 1:1 matching (293 pairs, regarding >16 as treatment).
Comment of the 'success' of the matching and implications for the popular extreme groups strategy, seen in part d.
[note: machit anoyingly requires data set have no missing data even if not using those variables. need to select off cols you are using into a
subset dataset]
(ii) Use the results from the pair matching to carry out outcomes (lwage) analysis.
Two suggested ways to do this.
Use lmer (measured outcome) with 293 subclasses. An issue there is that with only two observations in each
subclass, lmer needs to be coaxed into performing by adding the statement control = lmerControl(check.nobs.vs.nRE = "warning")
after the usual data statement.
Or use a paired t-test on outcomes (or nonparametric equivalent) for the 293 pairs. To massage the data into an easy form for the paired t-test
use the same manipulations you need to follow for constructing a sensitivity data set for part f.
Compare your results with parts a-d, and note any problems impediments in this analysis.
part f. Sensitivity analysis for 1:1 matching results
From the 1:1 matching exercise, construct a dataset suitable for sensitivity analyses (CC week5, lecture 4 and 5). Carry out appropriate sensitivity
calculations for the results in part e. How large must the parameter "gamma" be to render the H L educ comparison non-significant?
--------------------------------------------------------------------------------------------------------
Extra Credit or just to remind you that we displayed these advanced dose-response methods (i.e. read-only)
part g. Dose-response functions for Observational Data
Week 7 CC (beyond binary treatments) illustrated developing methods for dose-response estimation using generalized propensity scores. Try out one
of those estimation methods (e.g. HI) for the lwage,educ dose-response function using cd_1 data. Use the measures from part b,e as background characteristics.
Compare results with the prior analyses.
If using the formal ADRF methods is too imposing (or doesn't work) try this informal two-part method (adapted from binary treatment).
Also see ADRF summary slide
Estimate expected dose using background measures
[analog to logistic regression predicting binary treatment; fit is estimate of expected value of 0,1 treatment var]
Estimate dose-response using that measure as an additional predictor (analogous to using propensity score as a covariate in binary treatments in earlier CC).
-------------------------------------------------------------------------------------------------------
END Problem 2