This article is an introduction to Double/Debiased Machine Learning using short-stacking in R. Topics discussed below include:
For illustration, we apply ddml to the included random
subsample of 5,000 observations from the data of Angrist & Evans
(1998). The data contains information on the labor supply of mothers,
their children, as well as demographic data. See ?AE98 for
details.
# Load ddml and set seed
library(ddml)
set.seed(35611)
# Construct variables from the included Angrist & Evans (1998) data
y = AE98[, "worked"]
D = AE98[, "morekids"]
Z = AE98[, "samesex"]
X = AE98[, c("age","agefst","black","hisp","othrace","educ")]ddml_late estimates the local average treatment effect
(LATE) using Double/Debiased Machine Learning (see
?ddml_late). The high-dimensional nuisance parameters
arising in the estimate of the LATE are conditional expectation
functions of the control variables \(X\). In particular, we require first step
estimates of the reduced forms \(E[Y|Z=z, X],
E[D|Z=z, X]\) for \(z=0,1\) and
\(E[Z|X]\). In the absence of
functional form assumptions, these conditional expectations need to be
estimated nonparametrically.
Here, we consider gradient boosting from the popular xgboost package to estimate the
nuisance parameters. The function mdl_xgboost is a wrapper
for xgboost, allowing to specify all parameters of the
original function. See ?mdl_xgboost for details and take a
look at vignette("new_ml_wrapper") to learn how to write a
wrapper for a different machine learner yourself.
# Specify a single learner
learners_single <- list(what = mdl_xgboost,
args = list(nrounds = 10,
max_depth = 1))Double/Debiased Machine Learning relies on cross-fitting to avoid
large bias from overfitting when estimating the nuisance parameters. The
argument sample_folds = 3 implies that 2/3 of the
observations – about 3,333 observations – are used to train the machine
learner in each cross-fitting sample fold.
# Estimate the local average treatment effect using xgboost.
late_fit <- ddml_late(y, D, Z, X,
learners = learners_single,
sample_folds = 3,
silent = TRUE)
summary(late_fit)
#> DDML estimation: Local Average Treatment Effect
#> Obs: 5000 Folds: 3
#>
#> Estimate Std. Error z value Pr(>|z|)
#> LATE -0.244 0.196 -1.25 0.21(Note that estimation here is based on a random subsample of 5,000 observations. The results can thus not readily be compared to those in Angrist & Evans (1998).)
Since the statistical properties of machine learners depend heavily
on the underlying (unknown!) structure of the data, adaptive combination
of multiple machine learners can increase robustness. In the below
snippet, ddml_late estimates the LATE with short-stacking
based on three base learners:
?ols)?mdl_glmnet)?mdl_xgboost)learners_multiple <- list(list(what = ols),
list(what = mdl_glmnet),
list(what = mdl_xgboost,
args = list(nrounds = 10,
max_depth = 1)))Short-stacking is a computationally convenient variant of stacking
originally introduced by Wolpert (1992). Stacking constructs linear
combinations of base learners to minimize the out-of-sample mean squared
error of a particular reduced form (e.g., \(E[Z|X]\)). Short-stacking uses the
out-of-sample predictions that naturally arise in computation of
Double/Debiased Machine Learning estimates due to cross-fitting, which
substantially reduces the computational burden (see
vignette("stacking")).
In finite samples, regularizing the linear combination of base
learners as constructed via (short-)stacking can improve statistical
properties. This can be specified via the ensemble_type
argument. Below, ddml_late estimates the nuisance
parameters via linear combinations of the three base learners with
linear coefficients that are constrained to be non-negative and sum to
one.
# Estimate the local average treatment effect using short-stacking with base
# learners ols, lasso, and xgboost.
late_fit <- ddml_late(y, D, Z, X,
learners = learners_multiple,
ensemble_type = 'nnls1',
shortstack = TRUE,
sample_folds = 3,
silent = TRUE)
summary(late_fit)
#> DDML estimation: Local Average Treatment Effect
#> Obs: 5000 Folds: 3 Stacking: short-stack
#>
#> Estimate Std. Error z value Pr(>|z|)
#> LATE -0.235 0.189 -1.24 0.21It is often insightful to see which base learners contribute the most
to the final reduced form estimates. The diagnostics()
function provides a compact summary of each learner’s out-of-sample mean
squared prediction error (MSPE), \(R^2\), and the stacking weight assigned by
the chosen ensemble type:
diagnostics(late_fit)
#> Stacking diagnostics: Local Average Treatment Effect
#> Obs: 5000
#>
#> y_X_Z0:
#> learner mspe r2 weight_nnls1
#> learner_1 0.2415 0.0289 0.7891
#> learner_2 0.2416 0.0283 0.0000
#> learner_3 0.2442 0.0179 0.2109
#> nnls1 0.2413 0.0297 NA
#>
#> y_X_Z1:
#> learner mspe r2 weight_nnls1
#> learner_1 0.2421 0.0305 0.8593
#> learner_2 0.2421 0.0305 0.0000
#> learner_3 0.2462 0.0143 0.1407
#> nnls1 0.2420 0.0310 NA
#>
#> D_X_Z0:
#> learner mspe r2 weight_nnls1
#> learner_1 0.2085 0.0820 0.8911
#> learner_2 0.2085 0.0820 0.0000
#> learner_3 0.2141 0.0573 0.1089
#> nnls1 0.2084 0.0824 NA
#>
#> D_X_Z1:
#> learner mspe r2 weight_nnls1
#> learner_1 0.2231 0.0828 0.909
#> learner_2 0.2231 0.0828 0.000
#> learner_3 0.2297 0.0556 0.091
#> nnls1 0.2231 0.0830 NA
#>
#> Z_X:
#> learner mspe r2 weight_nnls1
#> learner_1 0.2502 -0.0011 0
#> learner_2 0.2499 0.0001 1
#> learner_3 0.2503 -0.0015 0
#> nnls1 0.2499 0.0001 NA
#>
#> Note: Ensemble MSPE and R2 for short-stacking rely on full-sample weights
#> and represent in-sample fit over cross-fitted base predictions.For formal statistical tests of whether learner performance
differences are significant, see
vignette("stacking_diagnostics").
ddml supports multiple schemes for constructing linear
combinations of base learners. Since each of them relies on the
out-of-sample predictions of the base learners, it is computationally
cheap to compute them simultaneously. The below snippet estimates the
LATE using the base learners in four different linear combinations:
'nls' constraints the coefficients of each base learner
to be non-negative'singlebest' selects the single MSPE-minimizing base
learner'ols' constructs unconstrained linear combinations of
base learners'average' computes an unweighted average of base
learners# Estimate the local average treatment effect using short-stacking with base
# learners ols, lasso, ridge, and xgboost.
late_fit <- ddml_late(y, D, Z, X,
learners = learners_multiple,
ensemble_type = c('nnls', 'singlebest',
'ols', 'average'),
shortstack = TRUE,
sample_folds = 3,
silent = TRUE)
summary(late_fit)
#> DDML estimation: Local Average Treatment Effect
#> Obs: 5000 Folds: 3 Stacking: short-stack
#>
#> Ensemble type: nnls
#> Estimate Std. Error z value Pr(>|z|)
#> LATE -0.231 0.189 -1.22 0.22
#>
#> Ensemble type: singlebest
#> Estimate Std. Error z value Pr(>|z|)
#> LATE -0.227 0.188 -1.21 0.23
#>
#> Ensemble type: ols
#> Estimate Std. Error z value Pr(>|z|)
#> LATE -0.231 0.189 -1.22 0.22
#>
#> Ensemble type: average
#> Estimate Std. Error z value Pr(>|z|)
#> LATE -0.228 0.189 -1.21 0.23The learner interface and stacking workflow are identical across all
ddml estimators. For example, the partially linear model
(PLM) estimates the effect of D on y while
nonparametrically controlling for X – without requiring an
instrument. The same set of base learners can be reused directly:
plm_fit <- ddml_plm(y, D, X,
learners = learners_multiple,
ensemble_type = 'nnls1',
shortstack = TRUE,
sample_folds = 3,
silent = TRUE)
summary(plm_fit)
#> DDML estimation: Partially Linear Model
#> Obs: 5000 Folds: 3 Stacking: short-stack
#>
#> Estimate Std. Error z value Pr(>|z|)
#> D1 -1.46e-01 1.47e-02 -9.94 <2e-16 ***
#> (Intercept) 7.46e-05 6.89e-03 0.01 0.99
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Beyond ddml_late and ddml_plm, the package
supports several other causal models:
ddml_ate – average treatment effect (see
?ddml_ate)ddml_att – average treatment effect on the treated (see
?ddml_att)ddml_pliv – partially linear IV model (see
?ddml_pliv)ddml_fpliv – flexible partially linear IV model (see
?ddml_fpliv)Because Double/Debiased Machine Learning relies on random sample
splits, point estimates contain splitting variance. To
stabilize results, ddml_replicate() repeats the
cross-fitting procedure multiple times and aggregates the estimates (see
Ahrens et al., 2024):
plm_rep <- ddml_replicate(ddml_plm,
y = y, D = D, X = X,
learners = learners_multiple,
ensemble_type = 'nnls1',
shortstack = TRUE,
sample_folds = 3,
resamples = 5,
silent = TRUE)
summary(plm_rep)
#> DDML estimation: Partially Linear Model
#> Obs: 5000 Folds: 3 Resamples: 5 Aggregation: median Stacking: short-stack
#>
#> Estimate Std. Error z value Pr(>|z|)
#> D1 -1.48e-01 1.47e-02 -10.1 <2e-16 ***
#> (Intercept) 7.04e-06 6.88e-03 0.0 1
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1For a detailed discussion of aggregation methods and the
ddml_rep class, see
vignette("repeated_resampling").
Check out the following articles to learn more:
vignette("stacking") discusses computational benefits
of short-stackingvignette("stacking_diagnostics") shows how to evaluate
base learners and test for equal predictive abilityvignette("repeated_resampling") demonstrates robust
inference via repeated cross-fittingvignette("modelsummary_integration") illustrates
integration with broom and modelsummaryvignette("new_ml_wrapper") shows how to write
user-provided base learnersvignette("sparse") illustrates support of sparse
matrices (see ?Matrix)vignette("did") discusses integration with the
diff-in-diff package didFor additional applied examples, see:
vignette("example_401k") on the effect of 401k
participation on retirement savingsvignette("example_BLP95") on flexible demand estimation
with endogenous pricesAhrens A, Chernozhukov V, Hansen C B, Kozbur D, Schaffer M E, Wiemann T (2026). “An Introduction to Double/Debiased Machine Learning.” Journal of Economic Literature, forthcoming.
Ahrens A, Hansen C B, Schaffer M E, Wiemann T (2024). “Model Averaging and Double Machine Learning.” Journal of Applied Econometrics, 40(3): 249-269.
Angrist J, Evans W (1998). “Children and Their Parents’ Labor Supply: Evidence from Exogenous Variation in Family Size.” American Economic Review, 88(3), 450-477.
Wolpert D H (1992). “Stacked generalization.” Neural Networks, 5(2), 241-259.