DoubleML - An Object-Oriented Implementation of Double Machine Learning in R ∗

,


Introduction
Structural equation models provide a quintessential framework for conducting causal inference in statistics, econometrics, machine learning (ML), and other data sciences. The package DoubleML for R (R Core Team, 2020) implements partially linear and interactive structural equation and treatment effect models with high-dimensional confounding variables as considered in . Estimation and tuning of the machine learning models is based on the powerful functionalities provided by the mlr3 package and the mlr3 ecosystem (Lang et al., 2019). A key element of double machine learning (DML) models are score functions identifying the estimates for the target parameter. These functions play an essential role for valid inference with machine learning methods because they have to satisfy a property called Neyman orthogonality. With the score functions as key elements, DoubleML implements double machine learning in a very general way using object orientation based on the R6 package (Chang, 2020). Currently, DoubleML implements the double / debiased machine learning framework as established in  for • partially linear regression models (PLR), • partially linear instrumental variable regression models (PLIV), • interactive regression models (IRM), and • interactive instrumental variable regression models (IIVM).
The object-oriented implementation of DoubleML is very flexible. The model classes DoubleMLPLR, DoubleMLPLIV, DoubleMLIRM and DoubleIIVM implement the estimation of the nuisance functions via machine learning methods and the computation of the Neyman-orthogonal score function. All other functionalities are implemented in the abstract base class DoubleML, including estimation of causal parameters, standard errors, t-tests, confidence intervals, as well as valid simultaneous inference through adjustments of p-values and estimation of joint confidence regions based on a multiplier bootstrap procedure. In combination with the estimation and tuning functionalities of mlr3 and its ecosystem, this object-oriented implementation enables a high flexibility for the model specification in terms of • the machine learning methods for estimation of the nuisance functions, • the resampling schemes, • the double machine learning algorithm, and • the Neyman-orthogonal score functions.
It further can be readily extended regarding • new model classes that come with Neyman-orthogonal score functions being linear in the target parameter, • alternative score functions via callables, and • customized resampling schemes.
Several other packages for estimation of causal effects based on machine learning methods exist for R. Probably the most popular packages are the grf package (Tibshirani et al., 2020), which implements generalized random forests (Athey et al., 2019), the package hdm (Chernozhukov et al., 2016) for inference based on the lasso estimator and the hdi package (Dezeure et al., 2015) for inference in high-dimensional models. Previous implementations of the double machine learning (DML) framework of  have been provided by postDoubleR package (Szitas, 2019), the package dmlmt (Knaus, 2021) with a focus on lasso estimation, and causalDML (Knaus, 2020) for estimation of treatment effects under unconfoundedness. In python, EconML (Microsoft Research, 2019) offers an implementation of the double machine learning framework for heterogeneous effects. We would like to mention that the R package DoubleML was developed together with a Python twin (Bach et al., 2020) that is based on scikit-learn (Pedregosa et al., 2011). The python package is also available via GitHub, the Python Package Index (PyPI), and conda-forge. 1 Moreover, Kurz (2021) provides a serverless implementation of the python module DoubleML.
The rest of the paper is structured as follows: In Section 2, we briefly demonstrate how to install the DoubleML package and give a short motivating example to illustrate the major idea behind the double machine learning approach. Section 3 introduces the main causal model classes implemented in DoubleML. Section 4 shortly summarizes the main ideas behind the double machine learning approach and reviews the key ingredients required for valid inference based on machine learning methods. Section 5 presents the main steps and algorithms of the double machine learning procedure for inference on one or multiple target parameters. Section 6 provides more detailed insights on the implemented classes and methods of DoubleML. Section 7 contains real-data and simulation examples for estimation of causal parameters using the DoubleML package. Additionally, this section provides a brief simulation study that illustrates the validity of the implemented methods in finite samples. Section 8 concludes the paper. The code output that has been suppressed in the main text and further information regarding the simulations are presented in the Appendix. To make the code examples fully reproducible, the entire code is available online.

Getting started 2.1 Installation
The latest CRAN release of DoubleML can be installed using the command install.packages ("DoubleML") Alternatively, the development version can be downloaded and installed from the GitHub 2 repository using the command (previous installation of the remotes package is required) remotes::install_github("DoubleML/doubleml-for-r") Among others, DoubleML depends on the R package R6 for object oriented implementation, data.table (Dowle and Srinivasan, 2020) for the underlying data structure, as well as the packages mlr3 (Lang et al., 2019), mlr3learners (Lang et al., 2020a) and mlr3tuning (Becker et al., 2020) for estimation of machine learning methods, model tuning and parameter handling. Moreover, the underlying packages of the machine learning methods that are called in mlr3 or mlr3learners must be installed, for example the packages glmnet for lasso estimation (Friedman et al., 2010) or ranger (Wright and Ziegler, 2017) for random forests.
Load the package after completed installation. library(DoubleML)

A Motivating Example: Basics of Double Machine Learning
In the following, we provide a brief summary of and motivation to double machine learning methods and show how the corresponding methods provided by the DoubleML package can be applied. The data generating process (DGP) is based on the introductory example in . We consider a partially linear model: Our major interest is to estimate the causal parameter θ in the following regression equation with covariates x i ∼ N (0, Σ), where Σ is a matrix with entries Σ kj = 0.7 |j−k| . In the following, the regression relationship between the treatment variable d i and the covariates x i will play an important role The nuisance functions m 0 and g 0 are given by We construct a setting with n = 500 observations and p = 20 explanatory variables to demonstrate the use of the estimators provided in DoubleML. Moreover, we set the true value of the parameter θ to θ = 0.5. The corresponding data generating process is implemented in the function make_plr_CCDHNR2018(). We start by generating a realization of a data set as a data.table object, which is subsequently used to create an instance of the data-backend of class DoubleMLData.
library(DoubleML) alpha = 0.5 n_obs = 500 n_vars = 20 set.seed(1234) data_plr = make_plr_CCDDHNR2018(alpha = alpha, n_obs = n_obs, dim_x = n_vars, return_type = "data. The data-backend implements the causal model: We specify that we perform inference on the effect of the treatment variable d i on the dependent variable y i . obj_dml_data = DoubleMLData$new(data_plr, y_col = "y", d_cols = "d") In the next step, we choose the machine learning method as an object of class Learner from mlr3, mlr3learners (Lang et al., 2020a) or mlr3extralearners (Sonabend and Schratz, 2020). As we will point out later, we have to estimate two nuisance parts in order to perform valid inference in the partially linear regression model. Hence, we have to specify two learners. Moreover, we split the sample into two folds used for cross-fitting.

Key Causal Models
DoubleML provides estimation of causal effects in four different models: Partially linear regression models (PLR), partially linear instrumental variable regression models (PLIV), interactive regression models (IRM) and interactive instrumental variable regression models (IIVM). We will shortly introduce these models.

Partially Linear Regression Model (PLR)
Partially linear regression models (PLR), which encompass the standard linear regression model, play an important role in data analysis (Robinson, 1988). Partially linear regression models take the form where Y is the outcome variable and D is the policy variable of interest. The high-dimensional vector X = (X 1 , . . . , X p ) consists of other confounding covariates, and ζ and V are stochastic errors. Equation (1) is the equation of interest, and θ 0 is the main regression coefficient that we would like to infer. If D is conditionally exogenous (randomly assigned conditional on X), θ 0 has the interpretation of a structural or causal parameter. The causal diagram supporting such interpretation is shown in Figure 1. The second equation keeps track of confounding, namely the dependence of D on covariates/controls. The characteristics X affect the policy variable D via the function m 0 (X) and the outcome variable via the function g 0 (X). The partially linear model generalizes both linear regression models, where functions g 0 and m 0 are linear with respect to a dictionary of basis functions with respect to X, and approximately linear models. A causal diagram underlying Equation (1)-(2) and (5)-(6) under conditional exogeneity. Note that the causal link between D and Y is one-directional. Identification of the causal effect is confounded by X, and identification is achieved via V , which captures variation in D that is independent of X. Methods to estimate the causal effect of D must therefore approximately remove the effect of high-dimensional X on Y and D.

Partially Linear Instrumental Variable Regression Model (PLIV)
We next consider the partially linear instrumental variable regression model: Note that this model is not a regression model unless Z = D. Model (3)-(4) is a canonical model in causal inference, going back to Wright (1928), with the modern difference being that g 0 and m 0 are nonlinear, potentially complicated functions of high-dimensional X. The idea of this model is that there is a structural or causal relation between Y and D, captured by θ 0 , and g 0 (X) + ζ is the stochastic error, partly explained by covariates X. V and ζ are stochastic errors that are not explained by X. Since Y and D are jointly determined, we need an external factor, commonly referred to as an instrument, Z, to create exogenous variation in D. Note that Z should affect D. The X here serve again as confounding factors, so we can think of variation in Z as being exogenous only conditional on X.
A simple contextual example is from biostatistics (Permutt and Hebel, 1989), where Y is a health outcome and D is an indicator of smoking. Thus, θ 0 captures the effect of smoking on health. Health outcome Y and smoking behavior D are treated as being jointly determined. X represents patient characteristics, and Z could be a doctor's advice not to smoke (or another behavioral treatment) that may affect the outcome Y only through shifting the behavior D, conditional on characteristics X.

Interactive Regression Model (IRM)
We consider estimation of average treatment effects when treatment effects are fully heterogeneous and the treatment variable is binary, D ∈ {0, 1}. We consider vectors (Y, D, X) such that Since D is not additively separable, this model is more general than the partially linear model for the case of binary D. A common target parameter of interest in this model is the average treatment effect (ATE), 3 Another common target parameter is the average treatment effect for the treated (ATTE), A causal diagram underlying Equation (3)-(4) and (7)-(8) under conditional exogeneity of Z. Note that the causal link between D and Y is bi-directional, so an instrument Z is needed for identification. Identification is achieved via V that captures variation in Z that is independent of X. Equations (3) and (4) do not model the dependence between D and X and Z, though a necessary condition for identification is that Z and D are related after conditioning on X. Methods to estimate the causal effect of D must approximately remove the effect of high-dimensional X on Y , D, and Z. Removing the confounding effect of X is done implicitly by the proposed procedure.
In business applications, the ATTE is often the main interest, as it captures the treatment effect for those who have been affected by the treatment. A difference of the ATTE from the ATE might arise if the characteristics of the treated individuals differ from those of the general population.
The confounding factors X affect the policy variable via the propensity score m 0 (X) and the outcome variable via the function g 0 (X). Both of these functions are unknown and potentially complex, and we can employ ML methods to learn them.

Interactive Instrumental Variable Model (IIVM)
We consider estimation of local average treatment effects (LATE) with a binary treatment variable D ∈ {0, 1}, and a binary instrument, Z ∈ {0, 1}. As before, Y denotes the outcome variable, and X is the vector of covariates. Here the structural equation model is: Consider the functions g 0 , r 0 , and m 0 , where g 0 maps the support of (Z, X) to R and r 0 and m 0 map the support of (Z, X) and X to ( , 1 − ) for some ∈ (0, 1/2), such that We are interested in estimating .
Under the well-known assumptions of Imbens and Angrist (1994), θ 0 is the LATE -the average treatment effect for compliers, in other words, those observations that would have D = 1 if Z were 1 and would have D = 0 if Z were 0. is based on estimation of g 0 and m 0 with random forests and a non-orthogonal score function. Data sets are simulated according to the data generating process in Section 2.2. Data generation and estimation are repeated 1000 times. Right panel: Histogram of the studentized DML estimatorθ 0 .θ 0 is based on estimation of g 0 and m 0 with random forests and an orthogonal score function provided in Equation (17). Note that the simulated data sets and parameters of the random forest learners are identical to those underlying the left panel.

Basic Idea behind Double Machine Learning for the PLR Model
Here we provide an intuitive discussion of how double machine learning works in the first model, the partially linear regression model. Naive application of machine learning methods directly to equations (1)-(2) may have a very high bias. Indeed, it can be shown that small biases in estimation of g 0 , which are unavoidable in high-dimensional estimation, create a bias in the naive estimate of the main effect,θ 0 naive , which is sufficiently large to cause failure of conventional inference. The left panel in Figure 3 illustrates this phenomenon. The histogram presents the empirical distribution of the studentized estimator,θ naive 0 , as obtained in 1000 independent repetitions of the data generating process presented in Section 2.2. The functions g 0 and m 0 in the PLR model are estimated with random forest learners and corresponding predictions are then plugged into a non-orthogonal score function. The regularization performed by the random forest learner leads to a bias in estimation of g 0 and m 0 . Due to non-orthogonality of the score, this translates into a considerable bias of the main estimatorθ 0 naive : The distribution of the studentized estimatorθ naive 0 is shifted to the left of the origin and differs substantially from a normal distribution that would be obtained if the regularization bias was negligible as shown by the red curve.
The PLR model above can be rewritten in the following residualized form: The variables W and V represent original variables after taking out or partialling out the effect of X. Note that θ 0 is identified from this equation if V has a non-zero variance.
Given identification, double machine learning for a PLR proceeds as follows (1) Estimate l 0 and m 0 byl 0 andm 0 , which amounts to solving the two problems of predicting Y and D using X, using any generic ML method, giving us estimated residualŝ The residuals should be of a cross-validated form, as explained below in Algorithm 1 or 2, to avoid biases from overfitting.
(2) Estimate θ 0 by regressing the residualŴ onV . Use the conventional inference for this regression estimator, ignoring the estimation error in the residuals.
The reason we work with this residualized form is that it eliminates the bias arising from solving the prediction problems in stage (1). The estimatesl 0 andm 0 carry a regularization bias due to having to solve prediction problems well in high-dimensions. However, the nature of the estimating equation for θ 0 are such that these biases are eliminated to the first order, as explained below. This results in a high-quality low-bias estimatorθ 0 of θ 0 , as illustrated in the right panel of Figure 3. The estimator is adaptive in the sense that the first stage estimation errors do not affect the second stage errors.

Key Ingredients of the Double Machine Learning Inference Approach
Our goal is to construct high-quality point and interval estimators for θ 0 when X is high-dimensional and we employ machine learning methods to estimate the nuisance functions such as g 0 and m 0 . Example ML methods include lasso, random forests, boosted trees, deep neural networks, and ensembles or aggregated versions of these methods.
We shall use a method-of-moments estimator for θ 0 based upon the empirical analog of the moment condition where we call ψ the score function, W = (Y, D, X, Z), θ 0 is the parameter of interest, and η denotes nuisance functions with population value η 0 .
The first key input of the inference procedure is using a score function ψ(W ; θ; η) that satisfies (15), with θ 0 being the unique solution, and that obeys the Neyman orthogonality condition Neyman orthogonality (16) ensures that the moment condition (15) used to identify and estimate θ 0 is insensitive to small pertubations of the nuisance function η around η 0 . The derivative ∂ η denotes the pathwise (Gateaux) derivative operator.
Using a Neyman-orthogonal score eliminates the first order biases arising from the replacement of η 0 with a ML estimatorη 0 . Eliminating this bias is important because estimatorsη 0 must be heavily regularized in high dimensional settings to be good estimators of η 0 , and so these estimators will be biased in general. The Neyman orthogonality property is responsible for the adaptivity of these estimators -namely, their approximate distribution will not depend on the fact that the estimateη 0 contains error, if the latter is mild.
The right panel of Figure 3 presents the empirical distribution of the studentized DML estimatorθ 0 that is based on an orthogonal score. Note that estimation is performed on the identical simulated data sets and with the same machine learning method as for the naive learner, which is displayed in the left panel.
The histogram of the studentized estimatorθ 0 illustrates the favorable performance of the double machine learning estimator, which is based on an orthogonal score: The DML estimator is robust to the bias that is generated by regularization. The estimator is approximately unbiased, is concentrated around 0 and the distribution is well-approximated by the normal distribution.
• PLR score: In the PLR model, we can employ two alternative score functions. We will shortly indicate the option for initialization of a model object in DoubleML to clarify how each score can be implemented. Using the option score = "partialling out" leads to estimation of the score function where W = (Y, D, X) and l and m are P -square-integrable functions mapping the support of X to R, whose true values are given by Alternatively, it is possible to use the following score function for the PLR via the option score = "IV-type" with g and m being P -square-integrable functions mapping the support of X to R with values given by The scores above are Neyman-orthogonal by elementary calculations. Now, it is possible to see the connections to the residualized system of equations presented in Section 4.1.
• PLIV score: In the PLIV model, we employ the score function (score = "partialling out") where W = (Y, D, X, Z) and l, m, and r are P -square integrable functions mapping the support of X to R, whose true values are given by • IRM score: For estimation of the ATE parameter of the IRM model, we employ the score (score = "ATE") where W = (Y, D, X) and g and m map the support of (D, X) to R and the support of X to ( , 1 − ), respectively, for some ∈ (0, 1/2), whose true values are given by This orthogonal score is based on the influence function for the mean for missing data from Robins and Rotnitzky (1995). For estimation of the ATTE parameter in the IRM, we use the score (score = "ATTE") where p 0 = P(D = 1). Note that this score does not require estimating g 0 (1, X).
• IIVM score: To estimate the LATE paramter in the IIVM, we will use the score (score = "LATE") where W = (Y, D, X, Z) and the nuisance parameter η = (g, m, r) consists of P -square integrable functions g, m, and r, with g mapping the support of (Z, X) to R and m and r, respectively, mapping the support of (Z, X) and X to ( , 1 − ) for some ∈ (0, 1/2).
The second key input is the use of high-quality machine learning estimators for the nuisance parameters.
For instance, in the PLR model, we need to have access to consistent estimators of g 0 and m 0 with respect to the L 2 (P ) norm · P,2 , such that In the PLIV model, the sufficient condition is These conditions are plausible for many ML methods. Different structured assumptions on η 0 lead to the use of different machine-learning tools for estimating η 0 as listed in Chernozhukov et al. (2018, pp. 22-23): 1. The assumption of approximate or exact sparsity for η 0 with respect to some dictionary calls for the use of sparsity-based machine learning methods, for example the lasso estimator, post-lasso, l 2 -boosting, or forward selection, among others. 2. The assumption of density of η 0 with respect to some dictionary calls for density-based estimators such as the ridge. Mixed structures based on sparsity and density suggest the use of elastic net or lava. 3. If η 0 can be well approximated by tree-based methods, regression trees and random forests are suitable. 4. If η 0 can be well approximated by sparse, shallow or deep neural networks, l 1 -penalized neural networks, shallow neural networks or deep neural networks are attractive.
For most of these ML methods, performance guarantees are available that make it possible to satisfy the theoretical requirements. Moreover, if η 0 can be well approximated by at least one model mentioned in the list above, ensemble or aggregated methods can be used. Ensemble and aggregation methods ensure that the performance guarantee is approximately no worse than the performance of the best method.
The third key input is to use a form of sample splitting at the stage of producing the estimator of the main parameter θ 0 , which allows us to avoid biases arising from overfitting.
Biases arising from overfitting could result from using highly complex fitting methods such as boosting, random forests, ensemble, and hybrid machine learning methods. We specifically use cross-fitted forms of the empirical moments, as detailed below in Algorithms 1 and 2, in estimation of θ 0 . If we do not perform sample splitting and the ML estimates overfit, we may end up with very large biases. This is illustrated in Figure 4. The left panel shows the histogram of a studentized estimatorθ nosplit 0 withθ nosplit 0 being obtained from solving the orthogonal score of Equation (17) without sample splitting. All observations are used to learn functions g 0 and m 0 in the PLR model and to solve the score 1 N N i ψ(W i ;θ nosplit 0 ,η 0 ). Consequently, this overfitting bias leads to a considerable shift of the empirical distribution to the left. The double machine learning estimator underlying the histogram in the right panel is obtained with cross-fitting according to Algorithm 2. The sample-splitting procedure makes it possible to completely eliminate the bias induced by overfitting. is based on estimation of g 0 and m 0 with random forests and a procedure without sample-splitting: The entire data set is used for learning the nuisance terms and estimation of the orthogonal score. Data sets are simulated according to the data generating process in Section 2.2. Data generation and estimation are repeated 1000 times. Right panel: Histogram of the studentized DML estimatorθ 0 .θ 0 is based on estimation of g 0 and m 0 with random forests and the cross-fitting described in Algorithm 2. Note that the simulated data sets and parameters of the random forest learners are identical to those underlying the left panel.

Double Machine Learning for Estimation of a Causal Parameter
We assume that we have a sample (W i ) N i1 , modeled as i.i.d. copies of W = (Y, D, Z, X), whose law is determined by the probability measure P . We assume that N is divisible by K in order to simplify the notation. Let E N denote the empirical expectation Algorithm 1: DML1. (Generic double machine learning with cross-fitting) (1) Inputs: Choose a model (PLR, PLIV, IRM, IIVM), provide data (W i ) N i=1 , a Neyman-orthogonal score function ψ(W ; θ, η), which depends on the model being estimated, and specify machine learning methods for η.
(2) Train ML predictors on folds: Take a K-fold random partition ( (3) For each k ∈ [K], construct the estimatorθ 0,k as the solution to the equation The estimate of the causal parameter is obtained via aggregatioñ (4) Output: The estimate of the causal parameterθ 0 as well as the values of the evaluated score function are returned.
(2) Train ML predictors on folds: Take a K-fold random partition ( (3) Construct the estimator for the causal parameterθ 0 as the solution to the equation (4) Output: The estimate of the causal parameterθ 0 as well as the values of the evaluated score function are returned.

Remark 1 (Linear scores)
The score for the models PLR, PLIV, IRM and IIVM are linear in θ, having the form hence the estimatorθ 0,k for DML2 (θ 0,k for DML1) takes the form The linear score function representations of the PLR, PLIV, IRM and IIVM are • PLR with score = "partialling out" PLR with score = "IV-type" • PLIV with score = "partialling out" • IRM with score = "ATE" IRM with score = "ATTE" • IIVM with score = "LATE" Remark 2 (Sample Splitting) In Step (2) of the Algorithm DML1 and DML2, the estimatorη 0,k can generally be an ensemble or aggregation of several estimators as long as we only use the data (W i ) i ∈I k outside the k-th fold to construct the estimators.
Remark 3 (Recommendation) We have found that K = 4 or K = 5 to work better than K = 2 in a variety of empirical examples and in simulations. The default for the option n_folds that implements the value of K is n_folds=5. Moreover, we generally recommend to repeat the estimation procedure mutliple times and use the estimates and standard errors as aggregated over multiple repetitions as described in Chernozhukov et al. (2018, pp. 30-31). This aggregation will be automatically executed if the number of repetitions n_rep is set to a value larger than 1.
The properties of the estimator are as follows.
Theorem 1 There exist regularity conditions, such that the estimatorθ 0 concentrates in a 1/ √ Nneighborhood of θ 0 and the sampling error with mean zero and variance given by
Theorem 2 Under the same regularity condition, this interval contains θ 0 for approximately (1 − α) × 100 percent of data realizations Remark 4 (Brief literature overview on double machine learning) The presented double machine learning method was developed in . The idea of using property (16) to construct estimators and inference procedures that are robust to small mistakes in nuisance parameters can be traced back to Neyman (1959) and has been used explicitly or implicitly in the literature on debiased sparsity-based inference (Belloni et al., 2011(Belloni et al., , 2014bJavanmard and Montanari, 2014;van de Geer et al., 2014;Zhang and Zhang, 2014;Chernozhukov et al., 2015b) as well as (implicitly) in the classical semi-parametric learning theory with low-dimensional X (Bickel et al., 1993;Newey, 1994;Van der Vaart, 2000; Van der Laan and Rose, 2011). These references also explain that if we use scores ψ that are not Neyman-orthogonal in high dimensional settings, then the resulting estimators of θ 0 are not 1/ √ N consistent and are generally heavily biased.
Remark 5 (Literature on sample splitting). Sample splitting has been used in the traditional semiparametric estimation literature to establish good properties of semiparametric estimators under weak conditions (Schick, 1986;Van der Vaart, 2000). In sparse learning problems with high-dimensional X, sample splitting was employed in Belloni et al. (2012). There and here, the use of sample splitting results in weak conditions on the estimators of nuisance parameters, translating into weak assumptions on sparsity in the case of sparsity-based learning.  (2015), which considered estimation of the special case (1)-(2) using lasso without cross-fitting. This generalization, by relying upon cross-fitting, opens up the use of a much broader collection of machine learning methods and, in the case the lasso is used to estimate the nuisance functions, allows relaxation of sparsity conditions. All of these approaches can be seen as "debiasing" the estimation of the main parameter by constructing, implicitly or explicitly, score functions that satisfy the exact or approximate Neyman orthogonality.

Methods for Simultaneous Inference
In addition to estimation of target causal parameters, standard errors, and confidence intervals, the package DoubleML provides methods to perform valid simultaneous inference based on a multiplier bootstrap procedure introduced in Chernozhukov et al. (2013) and Chernozhukov et al. (2014) and suggested in high-dimensional linear regression models in Belloni et al. (2014a). Accordingly, it is possible to (i) construct simultaneous confidence bands for a potentially large number of causal parameters and (ii) adjust p-values in a test of multiple hypotheses based on the inferential procedure introduced above.
We consider a causal PLR with p 1 causal parameters of interest θ 0,1 , . . . , θ 0,p1 associated with the treatment variables D 1 , . . . , D p1 . The parameter of interest θ 0,j with j = 1, . . . , p 1 solves a corresponding moment condition as for example considered in Belloni et al. (2018). To perform inference in a setting with multiple target coefficients θ 0,j , the double machine learning procedure implemented in DoubleML iterates over the target variables of interest. During estimation of the coefficient θ 0,j , i.e., estimating the effect of treatment D j on Y , the remaining treatment variables enter the nuisance terms by default.
(2) Multiplier bootstrap: Generate random weights ξ b i for each bootstrap repetition b = 1, . . . , B according to a normal (Gaussian) bootstrap, wild bootstrap or exponential bootstrap. Based on the estimated standard errors given byσ j , we obtain bootstrapped versions of the coefficientsθ * ,b j and bootstrapped t-statistics t * ,b (3) Output: Output bootstrapped coefficients and test statistics.

Remark 7 (Computational efficiency)
The multiplier bootstrap procedure of Chernozhukov et al. (2013) and Chernozhukov et al. (2014) is computatioanally efficient because it does not require resampling and reestimation of the causal parameters. Instead, it is sufficient to introduce a random pertubation of the score ψ and solve for θ 0 , accordingly.
To construct simultaneous (1 − α)-confidence bands, the multiplier bootstrap presented in Algorithm 4 can be used to obtain a constant c 1−α that will guarantee asymptotic (1 − α) coverage The constant c 1−α is obtained in two steps.

Implementation Details
In this section, we briefly provide information on the implementation details such as the class structure, the data-backend and the use of machine learning methods. Section 7 provides a demonstration of DoubleML in real-data and simulation examples. More information on the implementation can be found in the DoubleML User Guide, that is available online 4 . All class methods are documented in the documentation of the corresponding class, which can be browsed online 5 or, for example, by using the commands help(DoubleML), help(DoubleMLPLR), or help(DoubleMLData) in R.

Class Structure
The implementation of DoubleML for R is based on object orientation as enabled by the the R6 package (Chang, 2020). For an introduction to object orientation in R and the R6 package, we refer to the vignettes of the R6 package that are available online 6 , Chapter 2.1 of Becker et al. (2021), and the chapters on object orientation in Wickham (2019). The structure of the classes are presented in Figure 5. The abstract class DoubleML provides all methods for estimation and inference, for example the methods fit(), bootstrap(), confint(). All key components associated with estimation and inference are implemented in DoubleML, for example the sample splitting, the implementation of Algorithm 1 (DML1) and Algorithm 2 (DML2), the estimation of the causal parameters, and the computation of the scores ψ(W ; θ, η). Only the model-specific properties and methods are allocated at the classes DoubleMLPLR (implementing the PLR), DoubleMLPLIV (PLIV), DoubleMLIRM (IRM), and DoubleMLIIVM (IIVM). For example, each of the models has one or several Neyman-orthogonal score functions that are implemented for the specific child classes.

Data-Backend and Causal Model
The DoubleMLData class serves as the data-backend and implements the causal model of interest. The user is required to specify the roles of the variables in a data set at hand. Depending on the causal model considered, it is necessary to declare the dependent variable, the treatment variable(s), confounding variables(s), and, in the case of instrumental variable regression, one or multiple instruments. The data-backend can be initialized from a data.table (Dowle and Srinivasan, 2020). DoubleML provides wrappers to initialize from data.frame and matrix objects, as well.

Learners, Parameters and Tuning
Generally, all learners provided by the packages mlr3, mlr3learners and mlr3extralearners can be used for estimation of the nuisance functions of the structural models presented above. An interactive list of supported learners is available at the mlr3extralearners website. 7 The mlr3extralearners package makes it possible to add new learners, as well. The performance of the double machine learning estimatorθ 0 will depend on the predictive quality of the used estimation method. Machine learning methods usually have several (hyper-)parameter that need to be adapted to a specific application. Tuning of model parameters can be either performed externally or internally. The latter is implemented in the method tune() and is further illustrated in an example in Section 7.6.2. Both cases build on the functionalities provided by the package mlr3tuning.

Modifications and Extensions
The flexible architecture of the DoubleML package allows users to modify the estimation procedure in many regards. Among others, users can provide customized sample splitting rules after initialization of the causal model via the method set_sample_splitting(). An example and the detailed requirements are provided in Section 7.7.1. Moreover, it is possible to adjust the Neyman-orthogonal score function by externally providing a customized function via the score option during initialization of the causal model object. A short example is presented in Section 7.7.2.

Estimation of Causal Parameters with DoubleML: Real-Data and Simulated Examples.
In this section, we will first demonstrate the use of DoubleML in a real-data example, which is based on data from the Pennsylvania Reemployment Bonus experiment (Bilias, 2000). This empirical example has been used in , as well. The goal in the empirical example is to estimate the causal parameter in a partially linear and an interactive regression model. In We further provide a short example is given on how to perform simultaneous inference with DoubleML. Finally, we present results from a short simulation study as a brief assessment of the finite-sample performance of the implemented estimators.

Initialization of the Data-Backend
We begin our real-data example by downloading Pennsylvania Reemployment Bonus data set. To do so, we use the call (a connection to the internet is required). The data-backend DoubleMLData can be initialized from a data.table object by specifying the dependent variable Y via a character in y_col, the treatment variable(s) D in d_cols, and the confounders X via x_cols. Moreover, in IV models, an instrument can be specified via z_cols. In the next step, we assign the roles to the variables in the data set: y_col = inuidur1 serves as outcome variable Y , the column d_cols = tg serves as treatment variable D and the columns x_cols specify the confounders. obj_dml_data_bonus = DoubleMLData$new(dt_bonus, y_col = "inuidur1", d_cols = "tg", x_cols = c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")) Remark 8 (Interface for data.frame and matrix ) To initialize an instance of the class DoubleMLData from a data.frame or a collection of matrix objects, DoubleML provides the convenient wrappers double_ml_data_from_data_frame() and double_ml_data_from_matrix(). Examples can be found in the user guide and in the corresponding documentation.

Initialization of the Causal Model
To initialize a PLR model, we have to provide a learner for each nuisance part in the model in Equation (1)-(2). In R, this is done by providing learners to the arguments ml_m for nuisance part m and ml_g for nuisance part g. We can pass a learner as instantiated in mlr3 and mlr3learners, for example a random forest as provided by the R package ranger (Wright and Ziegler, 2017). Previous installation of ranger is required. Moreover, we can specify the score (allowed choices for PLR are "partialling out" or "IV-type") and the algorithm via the option dml_procedure (allowed choices "dml1" and "dml2") . Optionally, it is possible to change the number of folds used for sample splitting through n_folds and the number of repetitions via n_rep, if the sample splitting and estimation procedure should be repeated.

Estimation of the Causal Parameter in a PLR Model
To perform estimation, call the fit() method. The output can be summarized using the method summary(). Hence, we can reject the null hypothesis that θ 0,tg = 0 at the 5% significance level. The estimated coefficient and standard errors can be accessed via the public fields coef and se of the object doubleml_bonus. After completed estimation, we can access the resulting score ψ(W i ;θ 0 ,η 0 ) or the components ψ a (W i ;η 0 ) and ψ b (W i ;η 0 ). The estimated score for the first 5 observations can be obtained via. Similarly, the components of the score ψ a (W i ;η 0 ) and ψ b (W i ;η 0 ) are available as public fields.

Estimation of the Causal Parameter in an IRM Model
The treatment variable D in the Pennsylvania Reemployment Bonus example is binary. Accordingly, it is possible to estimate an IRM model. Since the IRM requires estimation of the propensity score P(D|X), we have to specify a classifier for the nuisance part m 0 .

Simultaneous Inference in a Simulated Data Example
We consider a simulated example of a PLR model to illustrate the use of methods for simultaneous inference. First, we will generate a sparse linear model with only three variables having a non-zero effect on the dependent variable.

# suppress output doubleml_data
A sparse setting suggests the use of the lasso learner. Here, we use the lasso estimator with cross-validated choice of the penalty parameter λ as provided in the glmnet package for R (Friedman et al., 2010).
# output messages during fitting are suppressed ml_g = lrn("regr.cv_glmnet", s = "lambda.min") ml_m = lrn("regr.cv_glmnet", s = "lambda. The multiplier bootstrap procedure can be executed using the bootstrap() method where the option method specifies the choice of the random pertubations and n_rep_boot the number of bootstrap repetitions.
doubleml_plr$bootstrap(method = "normal", n_rep_boot = 1000) The resulting bootstrapped coefficients and t-statistics are available via the public fields boot_coef and boot_t_stat. To construct a simultaneous confidence interval, we set the option joint = TRUE when calling the confint() method. The correction of the p-values of a joint hypotheses test on the considered causal parameters is implemented in the method p_adjust(). By default, the adjustment procedure specified in the option method is the Romano-Wolf stepdown procedure. Alternatively, the correction methods provided in the stats function p.adjust can be applied, for example the Bonferroni, Bonferroni-Holm, or Benjamini-Hochberg correction. For example a Bonferroni correction could be performed by specifying method = "bonferroni".

Learners, Parameters and Tuning
The performance of the final double machine learning estimator depends on the predictive performance of the underlying ML method. First, we briefly show how externally tuned parameters can be passed to the learners in DoubleML. Second, it is demonstrated how the parameter tuning can be done internally by DoubleML.

External Tuning and Parameter Passing
Section 3 of the mlr3book (Becker et al., 2021) provides a step-by-step introduction to the powerful tuning functionalities of the mlr3tuning package. Accordingly, it is possible to manually reconstruct the mlr3 regression and classification problems, which are internally handled in DoubleML, and to perform parameter tuning accordingly. One advantage of this procedure is that it allows users to fully exploit the powerful benchmarking and tuning tools of mlr3 and mlr3tuning.
Consider the sparse regression example from above. We will briefly consider a setting where we explicitly set the parameter λ for a glmnet estimator rather than using the interal cross-validated choice with cv_glmnet.
Suppose for simplicity, some external tuning procedure resulted in an optimal value of λ = 0.1 for nuisance part m and λ = 0.09 for nuisance part g for the first treatment variable and λ = 0.095 and λ = 0.085 for the second variable, respectively. After initialization of the model object, we can set the parameter values using the method set_ml_nuisance_params().

Internal Tuning and Parameter Passing
An alternative to external tuning and parameter provisioning is to perform the tuning internally. The advantage of this approach is that users do not have to specify the underlying prediction problems manually. Instead, DoubleML uses the underlying data-backend to ensure that the machine learning methods are tuned for the specific model under consideration and, hence, to possibly avoid mistakes. We initialize our structural model object with the learner. At this stage, we do not specify any parameters.
The entries in the list specify options during parameter tuning with mlr3tuning: • terminator is a bbotk::Terminator object passed to mlr3tuning that manages the budget to solve the tuning problem.
• algorithm is an object of class mlr3tuning::Tuner and specifies the tuning algorithm. Alternatively, algorithm can be a character() that is used as an argument in the wrapper mlr3tuning call tnr(algorithm). The Tuner class in mlr3tuning supports grid search, random search, generalized simulated annealing and non-linear optimization.
• rsmp_tune is an object of class resampling object that specifies the resampling method for evaluation, for example rsmp("cv", folds = 5) implements 5-fold cross-validation. rsmp("holdout", ratio = 0.8) implements an evaluation based on a hold-out sample that contains 20 percent of the observations. By default, 5-fold cross-validation is performed.
• measure is a named list containing the measures used for tuning of the nuisance components. The names of the entries must match the learner names (see method learner_names()). The entries in the list must either be objects of class Measure or keys passed to msr(). If measure is not provided by the user, the mean squared error is used for regression models and the classification error for binary outcomes, by default.
In the next code chunk, the value of the parameter λ is tuned via grid search in the range 0.05 to 0.1 at a resolution of 11. 8 To evaluate the predictive performance in both nuisance parts, the cross-validated mean squared error is used.
# Provide tune settings tune_settings = list(terminator = trm("evals", n_evals = 100), algorithm = tnr("grid_search", resolution = 11), rsmp_tune = rsmp("cv", folds = 5), measure = list("ml_g" = msr("regr.mse"), "ml_m" = msr("regr.mse"))) With these parameters we can run the tuning by calling the tune method for DoubleML objects. By default, the parameter tuning is performed on the whole sample, for example in the case of K tunefold cross-validation, the entire sample is split into K tune folds for evaluation of the cross-validated error. Alternatively, each of the K folds used in the cross-fitting procedure could be split up into K tune subfolds that are then used for evaluation of the candidate models. As a result, the choice of the tuned parameters will be fold-specific. To perform fold-specific tuning, users can set the option tune_on_folds = TRUE when calling the method tune().

Specifications and Modifications of Double Machine Learning
The flexible architecture of the DoubleML package allows users to modify the estimation procedure in many regards. We will shortly present two examples on how users can adjust the double machine learning framework to their needs in terms of the sample splitting procedure and the score function.

Sample Splitting
By default, DoubleML performs cross-fitting as presented in Algorithms 1 and 2. Alternatively, all implemented models allow a partition to be provided externally via the method set_sample_splitting(). Note that by setting draw_sample_splitting = FALSE one can prevent that a partition is drawn during initialization of the model object. The following calls are equivalent. In the first sample code, we use the standard interface and draw the sample-splitting with K = 4 folds during initialization of the DoubleMLPLR object.
# First generate some data, ml learners and a data-backend learner = lrn("regr.ranger", num.trees = 100, mtry = 20, min.node.size = 2, max.depth = 5) ml_g = learner ml_m = learner data = make_plr_CCDDHNR2018(alpha = 0.5, n_obs = 100, return_type = "data. In the second sample code, we manually specify a sampling scheme using the mlr3::Resampling class. Alternatively, users can provide a nested list that has the following structure: • The length of the outer list must match with the desired number of repetitions of the sample-splitting, i.e., n_rep. • The inner list is a named list of length 2 specifying the test_ids and train_ids. The named entries test_ids and train_ids are lists of the same length. -train_ids is a list of length n_folds that specifies the indices of the observations used for model fitting in each fold. -test_ids is a list of length n_folds that specifies the indices of the observations used for calculation of the score in each fold. doubleml_plr_external = DoubleMLPLR$new(doubleml_data, ml_g, ml_m, draw_sample_splitting = FALSE) set.seed(314) # set up a task and cross-validation resampling scheme in mlr3 my_task = Task$new("help task", "regr", data) my_sampling = rsmp("cv", folds = 4)$instantiate(my_task) train_ids = lapply(1:4, function(x) my_sampling$train_set(x)) test_ids = lapply(1:4, function(x) my_sampling$test_set(x)) smpls = list(list(train_ids = train_ids, test_ids = test_ids)) # Structure of the specified sampling scheme str(smpls) Setting the option apply_cross_fitting = FALSE at the instantiation of the causal model allows double machine learning being performed without cross-fitting. It results in randomly splitting the sample into two parts. The first half of the data is used for the estimation of the nuisance models with the machine learning methods and the second half for estimating the causal parameter, i.e., solution of the score. Note that cross-fitting performs well empirically and is recommended to remove bias induced by overfitting. Moreover, cross-fitting allows to exploit full efficiency: Every fold is used once for training the ML methods and once for estimation of the score (Chernozhukov et al., 2018, pp. 6). A short example on the efficiency gains associated with cross-fitting is provided in Section 7.8.1.

Score Function
Users may want to adjust the score function ψ(W ; θ 0 , η 0 ), for example, to adjust the DML estimators in terms of a re-weighting. An alternative to the choices provided in DoubleML is to pass a function via the argument score during initialization of the model object.
The following examples are equivalent. In the first example, we use the score option "partialling out" for the PLR model whereas in the second case, we explicitly provide a function that implements the same score. The arguments used in the function refer to the internal objects that implement the theoretical quantities in Equation (17) We define the function that implements the same score and specify the argument score accordingly. The function must return a named list with entries psi_a and psi_b to pass values for computation of the score.

A Short Simulation Study
To illustrate the validity of the implemented double machine learning estimators, we perform a brief simulation study.

The Role of Cross-Fitting
As mentioned in Section 7.7.1 the use of the cross-fitting Algorithms 1 (DML1) and 2 (DML2) makes it possible to use sample splitting and exploit full efficiency at the same time. To illustrate the superior is the double machine learning estimator obtained from a sample split into two folds. One fold is used for estimation of the nuisance parameters and the second fold is used for evaluation of the score function and estimation. The empirical distribution can be well-approximated by a normal distribution as indicated by the red curve. Right panel: Histogram of the centered dml estimator with cross-fitting, θ 0 − θ 0 . The estimator is obtained from a split into two folds and application of Algorithm 2 (DML2). In both cases, the estimators are based on estimation of g 0 and m 0 with random forests and an orthogonal score function provided in Equation (17). Moreover, exactly the same data sets and exactly the same partitions are used for sample splitting. The empirical distribution of the estimator that is based on cross-fitting exhibits a more pronounced concentration around zero, which reflects the smaller standard errors. performance due to cross-fitting, we compare the double machine learning estimator with and without a cross-fitting procedure in the simulation setting that was presented in 4.1. Figure 6 illustrates that efficiency gains can be achieved if the role of the random partitions is swapped in the estimation procedure. Using cross-fitting makes it possible to obtain smaller standard errors for the DML estimator: The empirical distribution of the double machine learning estimator that is based on the cross-fitting Algorithm 2 (DML2) exhibits a more pronounced concentration around zero.

Inference on a Structural Parameter in Key Causal Models
We provide simulation results for double machine learning estimators in the presented key causal models in Figure 7. In a replication of the simulation example in Section 4.1, we show that the confidence intervals for the DML estimator in the partially linear regression model achieves an empirical coverage close to the specified level of 1 − α = 0.95. The estimator is, again, based on a random forest learner. The corresponding results are presented in the top-left panel of Figure 7.
In a simulated example of a PLIV model, the DML confidence interval that is based on a lasso learner (regr.cv_glmnet of mlr3) achieves a coverage of 94.4%. The underlying data generating process is based on a setting considered in Chernozhukov et al. (2015a) with one instrumental variable. Moreover for simulations of the IRM model, we make use of a DGP of Belloni et al. (2017). The DGP for the IIVM is inspired by a simulation run in Farbmacher et al. (2020). We present the formal DGPs in the Appendix. To perform estimation of the nuisance parts in the interactive models, we employ the regression and classification predictors regr.cv_glmnet and classif.cv_glmnet as provided by the mlr3 package. In all cases, we employ the cross-validated lambda.min choice of the penalty parameter with five folds, in other words, that λ value that minimizes the cross-validated mean squared error. Figure 7 shows that the empirical distribution of the centered estimators as obtained in finite sample settings is relatively well-approximated by a normal distribution. In all models the empirical coverage that is achieved by the constructed confidence bands is close to the nominal level.

Simultaneous Inference
To verify the finite-sample performance of the implemented methods for simultaneous inference, we perform a small simulation study in a regression setup which is similar as the one used in Bach et al. (2018). We would like to perform valid simultaneous inference on the coefficients θ in the regression model with n = 1000 and p 1 = 42 regressors. The errors ε i are normally distributed with ε i ∼ N (0, σ 2 ) and variance σ 2 = 3. The regressors d i are generated by a joint normal distribution d i ∼ N (µ, Σ) with µ = 0 and Σ j,k = 0.5 |j−k| . The model is sparse in that only the first s = 12 regressors have a non-zero effect on outcome y i . The p 1 coefficients θ 1 , . . . , θ p1 are generated as θ j = min θ max j a , θ min , for j = 1, . . . , s with θ max = 9, θ min = 0.75, and a = 0.99. All other coefficients have values exactly equal to 0. Estimation of the nuisance components has been performed by using the lasso as provided by regr.cv_glmnet in mlr3.
We report the empirical coverage as achieved by a joint (1 − α)-confidence interval for all p 1 = 42 coefficients and the realized family-wise error rate of the implemented p-value adjustments in R = 500 repetitions in Table  1. The finite sample performance of the Romano-Wolf stepdown procedure that is based on the multiplier bootstrap as well as the classical Bonferroni and Bonferroni-Holm correction are evaluated. Table 1 shows that all methods achieve an empirical FWER close to the specified level of α = 0.1. In all cases, the double machine learning estimators reject all 12 false null hypotheses in every repetition.

Conclusion
In this paper, we provide an overview on the key ingredients and the major structure of the double/debiased machine learning framework as established in  together with an overview on a collection of structural models. Moreover, we introduce the R package DoubleML that serves as an implementation of the double machine learning approach. A brief simulation study provides insights on the finite sample performance of the double machine learning estimator in the key causal models.
The structure of DoubleML is intended to be flexible with regard to the implemented structural models, the resampling scheme, the machine learning methods and the underlying algorithm, as well as the Neymanorthogonal scores considered. By providing the R package DoubleML together with its Python twin (Bach et al., 2020), we hope to make double machine learning more accessible to users in practice. Finally, we would like to encourage users to add new structural models, scores and functionalities to the package.

Data generating process for PLIV simulation
The DGP is based on Chernozhukov et al. (2015a) and defined as where Σ is a p x n × p x n matrix with entries Σ kj = 0.5 |k−j| and I p z n is an identity matrix with dimension p z n × p z n . β = γ is a p x n -vector with entries β = 1 j 2 and Π = (I p z n , 0 p z n ×(p x n −p z n ) ). In the simulation example, we have one instrument, i.e., p z n = 1 and p x n = 20 regressors x i . In the simulation study, data sets with n = 500 observations are generated in R = 500 independent repetitions.

Data generating process for IRM simulation
The DGP is based on a simulation study in Belloni et al. (2017) and defined as with covariates x i ∼ N (0, Σ) where Σ is a matrix with entries Σ kj = 0.5 |k−j| . β is a p x -dimensional vector with entries β j = 1 j 2 and the constants c y and c d are determined as We set the values of R y = 0.5 and R d = 0.5 and consider a setting with n = 1000 and p = 20. Data generation and estimation have been performed in R = 500 independent replications.

Data generating process for IIVM simulation
The DGP is defined as with Z ∼ Bernoulli(0.5) and The covariates are drawn from a multivariate normal distribution with x i ∼ N (0, Σ) with entries of the matrix Σ being Σ kj = 0.5 |j−k| and β being a p x -dimensional vector with β j = 1 β 2 . The data generating process is inspired by a process used in a simulation in Farbmacher et al. (2020). In the simulation study, data sets with n = 1000 observations and p x = 20 confounding variables x i have been generated in R = 500 independent repetitions.