Estimating controlled direct effects

This vignette demonstrates how to estimate the controlled direct effect of some treatment, net the effect of some mediator using sequential g-estimation approach as described by Acharya, Blackwell, and Sen (2016). With standard regression and matching approaches, estimating the controlled direct effect is difficult because it often requires adjusting for covariates that causally affect the mediator and are causally affected by the treatment. Including such variables in a regression model can lead to post-treatment bias when estimating the direct effect of treatment. A large class of models and estimation approaches have been developed to avoid this problem, but here we focus on one called sequential g-estimation. This approach uses a standard linear model to estimate the effect of the mediator, and then removes that effect from effect from the dependent variable to estimate the direct effect of treatment without post-treatment bias.

Implementation

This vignette introduces a new way to design, specify, and implement various estimators for controlled direct effects using a coding style that we refer to as the “assembly line.” Users can specify the various components of the estimators sequentially and easily swap out implementation details such as what exact estimators are used for the estimation of each nuisance function or what arguments to pass to these functions. A unified structure and output across different research designs and estimators should make comparing estimators very easy. Finally, this implementation is built around the idea of cross-fitting for nuisance function estimation.

Let’s see how this works in practice. We first load a subset of the jobcorps data from the package:

data("jobcorps")
jc <- jobcorps[
  1:200,
  c("treat", "female", "age_cat", "work2year2q", "pemplq4", "emplq4", "exhealth30")
]

This is the Job Corps experiment data used in Huber (2014) to estimate the effect of a randomized job training program on self-assessed health holding constant the intermediate outcome of employment. We use a subset of the variables from their study, but the basic variable we use here:

\(Y_i\) is an indicator for whether participants reported “very good” health after 2.5 years out after randomization (exhealth30)
\(A_i\) is an indicator for assignment to the job training program (treat)
\(M_i\) is an indicator for employment 1 to 1.5 years after assignment (work2year2q)
\(Z_i\) are the post-treatment, pre-mediator intermediate confounders (emplq4, pemplq4)
\(X_i\) are the pre-treatment characteristics (female, age_cat)

In context of our CDE estimators, we refer to be \(A_i\) (the treatment) and \(M_i\) (the mediator) as treatment variables because we are interested in causal effects that contrast different combinations of these two variables.

The first step in our assembly line is to choose the type of CDE estimator (or really estimation strategy) that we want to pursue. These choices include IPW (cde_ipw()), regression imputation (cde_reg_impute()), and augmented IPW (cde_aipw()) among others. Some of these different functions will take different arguments to specify different aspects of the estimator. The AIPW estimator is useful to demonstrate because it uses many of the features of the assembly line approach.

my_aipw <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "logit") |>
  outreg_model(engine = "lm") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)

The call here consists of several steps connected by R’s pipe operator. The first call cde_aipw() simply initializes the estimator as taking the AIPW form, meaning that it will use both propensity score models and outcome regressions to estimate the causal effects.

Following this we have two blocks of 3 lines each that look similar with a few differences. These “treatment blocks” specify the causal variables of interest in their causal ordering. In set_treatment() we set the name of the treatment variable and give a formula describing the pre-treatment covariates for this treatment. This formula, unless overridden by the next two functions, will be used in the fitting of the propensity score and outcome regression models. treat_model() allows the user to specify how to fit the propensity score model for this treatment with a choice of engine and optional arguments to pass to each engine. outreg_model() does the same for the outcome regression. We repeat this specification for the mediator.

The estimate() function finally implements the specifications given in the previous commands. Users use a formula to set the outcome and which of the treatment variables we want effects for and indicate the data frame where all of these variables can be found. By default, the nuisance functions (the propensity score model and the outcome regression) are fit using cross-fitting.

We can either use the summary() or tidy() functions to get a summary of the different effects being estimated:

summary(my_aipw)

## 
##  aipw CDE Estimator
## ---------------------------
## Causal variable: treat 
## 
## Treatment model: treat ~ female + age_cat 
## Treatment engine: logit 
## 
## Outcome model: ~female + age_cat 
## Outcome engine: lm 
## ---------------------------
## Causal variable: work2year2q 
## 
## Treatment model: work2year2q ~ emplq4 + pemplq4 + female + age_cat 
## Treatment engine: logit 
## 
## Outcome model: ~emplq4 + pemplq4 + female + age_cat 
## Outcome engine: lm 
## ---------------------------
## Cross-fit: TRUE 
## Number of folds: 5 
## 
## Estimated Effects:
##                                  Estimate Std. Error  t value Pr(>|t|)
## treat [(1, 0) vs. (0, 0)]        0.030412    0.02259  1.34644  0.17824
## treat [(1, 1) vs. (0, 1)]        0.029877    0.01396  2.14080  0.03233
## work2year2q [(0, 1) vs. (0, 0)]  0.001241    0.02077  0.05974  0.95237
## work2year2q [(1, 1) vs. (1, 0)] -0.001672    0.01655 -0.10101  0.91955
##                                  CI Lower CI Upper   DF
## treat [(1, 0) vs. (0, 0)]       -0.013871  0.07469 3818
## treat [(1, 1) vs. (0, 1)]        0.002518  0.05724 6207
## work2year2q [(0, 1) vs. (0, 0)] -0.039475  0.04196 3991
## work2year2q [(1, 1) vs. (1, 0)] -0.034120  0.03078 6034

tidy(my_aipw)

##                                            term     estimate  std.error
## treat_1_0             treat [(1, 0) vs. (0, 0)]  0.030411569 0.02258657
## treat_1_1             treat [(1, 1) vs. (0, 1)]  0.029876867 0.01395590
## work2year2q_0_1 work2year2q [(0, 1) vs. (0, 0)]  0.001240634 0.02076711
## work2year2q_1_1 work2year2q [(1, 1) vs. (1, 0)] -0.001671883 0.01655227
##                   statistic    p.value     conf.low  conf.high   df
## treat_1_0        1.34644455 0.17823912 -0.013871340 0.07469448 3818
## treat_1_1        2.14080471 0.03232863  0.002518461 0.05723527 6207
## work2year2q_0_1  0.05974031 0.95236546 -0.039474510 0.04195578 3991
## work2year2q_1_1 -0.10100627 0.91954883 -0.034120251 0.03077648 6034

Because the estimate() function included both treat and work2year2q in the formula, the output includes both the controlled direct effects of the treatment and the conditional average treatment effect of the mediator. Furthermore, by default, the output contains estimates for an effect for each combination of other treatment variables. This can be changed by setting the eval_vals argument in the relevant set_treatment() call.

The modular structure of the call allow users to easily swap out parts of the models. For example, we could easily alter the above call to use lasso versions of the regression and logit model.

my_aipw_lasso <- cde_aipw() |>
  set_treatment(treat, ~ female + age_cat) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  set_treatment(work2year2q, ~ emplq4 + pemplq4) |>
  treat_model(engine = "rlasso_logit") |>
  outreg_model(engine = "rlasso") |>
  estimate(exhealth30 ~ treat + work2year2q, data = jobcorps)

Here we use a “rigorous” version of the lasso from the hdm package to speed up computation. There are a number of different engines for different outcome and treatment types including:

GLMs: lm (OLS), logit (logistic regression), multinom (mulitnomial logit)
LASSO estimators (from glmnet): lasso, lasso_logit, lasso_multinom
Rigorous LASSO estimator (from hdm): rlasso, rlasso_logit

Estimating controlled direct effects

2024-10-01

Quantity of interest

Implementation

References