In this tutorial, we show a typical usage of the package. For illustration purposes, let us generate some data:
## Generate data.
set.seed(1986)
<- 1000
n <- 3
k
<- matrix(rnorm(n * k), ncol = k)
X colnames(X) <- paste0("x", seq_len(k))
<- rbinom(n, size = 1, prob = 0.5)
D <- 0.5 * X[, 1]
mu0 <- 0.5 * X[, 1] + X[, 2]
mu1 <- mu0 + D * (mu1 - mu0) + rnorm(n) y
To construct the sequence of optimal groupings, we first need to estimate the CATEs. Here we use the causal forest estimator. To achieve valid inference about the GATEs, we split the sample into a training sample and an honest sample of equal sizes. The forest is built using only the training sample.
## Sample split.
<- sample_split(length(y), training_frac = 0.5)
splits <- splits$training_idx
training_idx <- splits$honest_idx
honest_idx
<- y[training_idx]
y_tr <- D[training_idx]
D_tr <- X[training_idx, ]
X_tr
<- y[honest_idx]
y_hon <- D[honest_idx]
D_hon <- X[honest_idx, ]
X_hon
## Estimate the CATEs. Use only training sample.
library(grf)
<- causal_forest(X_tr, y_tr, D_tr)
forest <- predict(forest, X)$predictions cates
Now we use the build_aggtree
function to construct the
sequence of groupings. This function approximates the estimated CATEs by
a decision tree using only the training sample and computes node
predictions (i.e., the GATEs) using only the honest sample.
build_aggtree
allows the user to choose between two GATE
estimators:
method = "raw"
, the GATEs are estimated by
taking the differences between the mean outcomes of treated and control
units in each node. This is an unbiased estimator (only) in randomized
experiments;method = "aipw"
, the GATEs are estimated by
averaging doubly-robust scores (see Appendix below) in each node. This
is an unbiased estimator also in observational studies under particular
conditions on the construction of the scores.The doubly-robust scores can be estimated separately and passed in by
the scores
argument. Otherwise, they are estimated
internally. Notice the use of the is_honest
argument, a
logical vector denoting which observations we allocated to the honest
sample. This way, build_aggtree
knows which observations
must be used to construct the tree and compute node predictions.
## Construct the sequence. Use doubly-robust scores.
<- build_aggtree(y, D, X, method = "aipw",
groupings cates = cates, is_honest = 1:length(y) %in% honest_idx)
## Print.
print(groupings)
## Plot.
plot(groupings) # Try also setting 'sequence = TRUE'.
Now that we have a whole sequence of optimal groupings, we can pick
the grouping associated with our preferred granularity level and call
the inference_aggtree
function. This function does the
following:
method
we used when we called
build_aggtree
(see Appendix below);To report the results, we can print nice LATEX tables.
## Inference with 4 groups.
<- inference_aggtree(groupings, n_groups = 4)
results
## LATEX.
print(results, table = "diff")
print(results, table = "avg_char")
The point of estimating the linear models is to get standard errors
for the GATEs. Under an honesty condition, we can use the estimated
standard errors to conduct valid inference as usual, e.g., by
constructing conventional confidence intervals. Honesty is a
subsample-splitting technique that requires that different observations
are used to form the subgroups and estimate the GATEs.
inference_aggtree
always uses the honest sample to estimate
the linear models below (unless we called build_aggtree
without using the honesty settings).
If we set method = "raw"
, inference_aggtree
estimates via OLS the following linear model:
\[\begin{equation} Y_i = \sum_{l = 1}^{|\mathcal{T_{\alpha}}|} L_{i, l} \, \gamma_l + \sum_{l = 1}^{|\mathcal{T}_{\alpha}|} L_{i, l} \, D_i \, \beta_l + \epsilon_i \end{equation}\]
with \(|\mathcal{T}_{\alpha}|\) the number of leaves of a particular tree \(\mathcal{T}_{\alpha}\), and \(L_{i, l}\) a dummy variable equal to one if the \(i\)-th unit falls in the \(l\)-th leaf of \(\mathcal{T}_{\alpha}\). Exploiting the random assignment to treatment, we can show that each \(\beta_l\) identifies the GATE in the \(l\)-th leaf. Under honesty, the OLS estimator \(\hat{\beta}_l\) of \(\beta_l\) is root-\(n\) consistent and asymptotically normal.
If we set method = "aipw"
,
inference_aggtree
estimates via OLS the following linear
model:
\[\begin{equation} \widehat{\Gamma}_i = \sum_{l = 1}^{|\mathcal{T}_{\alpha}|} L_{i, l} \, \beta_l + \epsilon_i \end{equation}\]
where \(\Gamma_i\) are the following doubly-robust scores:
\[\begin{equation*} \Gamma_i = \mu \left( 1, X_i \right) - \mu \left( 0, X_i \right) + \frac{D_i \left[ Y_i - \mu \left( 1, X_i \right) \right]}{p \left( X_i \right)} - \frac{ \left( 1 - D_i \right) \left[ Y_i - \mu \left( 0, X_i \right) \right]}{1 - p \left( X_i \right)} \end{equation*}\]
with \(\mu \left(D_i, X_i \right) =
\mathbb{E} \left[ Y_i | D_i, Z_i \right]\) the conditional mean
of \(Y_i\) and \(p \left( X_i \right) = \mathbb{P} \left( D_i = 1 |
X_i \right)\) the propensity score. These scores are inherited by
the scores used in the build_aggtree
call. As before, we
can show that each \(\beta_l\)
identifies the GATE in the \(l\)-th
leaf, this time even in observational studies. Under honesty, the OLS
estimator \(\hat{\beta}_l\) of \(\beta_l\) is root-\(n\) consistent and asymptotically normal,
provided that the \(\Gamma_i\) are
cross-fitted and that the product of the convergence rates of the
estimators of the nuisance functions \(\mu
\left( \cdot, \cdot \right)\) and \(p
\left( \cdot \right)\) is faster than \(n^{1/2}\).