In this tutorial, we show a typical usage of the package. For illustration purposes, let us generate some data:

```
## Generate data.
set.seed(1986)
<- 1000
n <- 3
k
<- matrix(rnorm(n * k), ncol = k)
X colnames(X) <- paste0("x", seq_len(k))
<- rbinom(n, size = 1, prob = 0.5)
D <- 0.5 * X[, 1]
mu0 <- 0.5 * X[, 1] + X[, 2]
mu1 <- mu0 + D * (mu1 - mu0) + rnorm(n) y
```

To construct the sequence of optimal groupings, we first need to estimate the CATEs. Here we use the causal forest estimator. To achieve valid inference about the GATEs, we split the sample into a training sample and an honest sample of equal sizes. The forest is built using only the training sample.

```
## Sample split.
<- sample_split(length(y), training_frac = 0.5)
splits <- splits$training_idx
training_idx <- splits$honest_idx
honest_idx
<- y[training_idx]
y_tr <- D[training_idx]
D_tr <- X[training_idx, ]
X_tr
<- y[honest_idx]
y_hon <- D[honest_idx]
D_hon <- X[honest_idx, ]
X_hon
## Estimate the CATEs. Use only training sample.
library(grf)
<- causal_forest(X_tr, y_tr, D_tr)
forest <- predict(forest, X)$predictions cates
```

Now we use the `build_aggtree`

function to construct the
sequence of groupings. This function approximates the estimated CATEs by
a decision tree using only the training sample and computes node
predictions (i.e., the GATEs) using only the honest sample.
`build_aggtree`

allows the user to choose between two GATE
estimators:

- If we set
`method = "raw"`

, the GATEs are estimated by taking the differences between the mean outcomes of treated and control units in each node. This is an unbiased estimator (only) in randomized experiments; - If we set
`method = "aipw"`

, the GATEs are estimated by averaging doubly-robust scores (see Appendix below) in each node. This is an unbiased estimator also in observational studies under particular conditions on the construction of the scores.

The doubly-robust scores can be estimated separately and passed in by
the `scores`

argument. Otherwise, they are estimated
internally. Notice the use of the `is_honest`

argument, a
logical vector denoting which observations we allocated to the honest
sample. This way, `build_aggtree`

knows which observations
must be used to construct the tree and compute node predictions.

```
## Construct the sequence. Use doubly-robust scores.
<- build_aggtree(y, D, X, method = "aipw",
groupings cates = cates, is_honest = 1:length(y) %in% honest_idx)
## Print.
print(groupings)
## Plot.
plot(groupings) # Try also setting 'sequence = TRUE'.
```

Now that we have a whole sequence of optimal groupings, we can pick
the grouping associated with our preferred granularity level and call
the `inference_aggtree`

function. This function does the
following:

- It gets standard errors for the GATEs by estimating via OLS
appropriate linear models using the honest sample. The choice of the
linear model depends on the
`method`

we used when we called`build_aggtree`

(see Appendix below); - It tests the null hypotheses that the differences in the GATEs across all pairs of groups equal zero. Here, we account for multiple hypotheses testing by adjusting the \(p\)-values using Holmâ€™s procedure;
- It computes the average characteristics of the units in each group.

To report the results, we can print nice LATEX tables.

```
## Inference with 4 groups.
<- inference_aggtree(groupings, n_groups = 4)
results
## LATEX.
print(results, table = "diff")
print(results, table = "avg_char")
```

The point of estimating the linear models is to get standard errors
for the GATEs. Under an honesty condition, we can use the estimated
standard errors to conduct valid inference as usual, e.g., by
constructing conventional confidence intervals. Honesty is a
subsample-splitting technique that requires that different observations
are used to form the subgroups and estimate the GATEs.
`inference_aggtree`

always uses the honest sample to estimate
the linear models below (unless we called `build_aggtree`

without using the honesty settings).

If we set `method = "raw"`

, `inference_aggtree`

estimates via OLS the following linear model:

\[\begin{equation} Y_i = \sum_{l = 1}^{|\mathcal{T_{\alpha}}|} L_{i, l} \, \gamma_l + \sum_{l = 1}^{|\mathcal{T}_{\alpha}|} L_{i, l} \, D_i \, \beta_l + \epsilon_i \end{equation}\]

with \(|\mathcal{T}_{\alpha}|\) the number of leaves of a particular tree \(\mathcal{T}_{\alpha}\), and \(L_{i, l}\) a dummy variable equal to one if the \(i\)-th unit falls in the \(l\)-th leaf of \(\mathcal{T}_{\alpha}\). Exploiting the random assignment to treatment, we can show that each \(\beta_l\) identifies the GATE in the \(l\)-th leaf. Under honesty, the OLS estimator \(\hat{\beta}_l\) of \(\beta_l\) is root-\(n\) consistent and asymptotically normal.

If we set `method = "aipw"`

,
`inference_aggtree`

estimates via OLS the following linear
model:

\[\begin{equation} \widehat{\Gamma}_i = \sum_{l = 1}^{|\mathcal{T}_{\alpha}|} L_{i, l} \, \beta_l + \epsilon_i \end{equation}\]

where \(\Gamma_i\) are the following doubly-robust scores:

\[\begin{equation*} \Gamma_i = \mu \left( 1, X_i \right) - \mu \left( 0, X_i \right) + \frac{D_i \left[ Y_i - \mu \left( 1, X_i \right) \right]}{p \left( X_i \right)} - \frac{ \left( 1 - D_i \right) \left[ Y_i - \mu \left( 0, X_i \right) \right]}{1 - p \left( X_i \right)} \end{equation*}\]

with \(\mu \left(D_i, X_i \right) =
\mathbb{E} \left[ Y_i | D_i, Z_i \right]\) the conditional mean
of \(Y_i\) and \(p \left( X_i \right) = \mathbb{P} \left( D_i = 1 |
X_i \right)\) the propensity score. These scores are inherited by
the scores used in the `build_aggtree`

call. As before, we
can show that each \(\beta_l\)
identifies the GATE in the \(l\)-th
leaf, this time even in observational studies. Under honesty, the OLS
estimator \(\hat{\beta}_l\) of \(\beta_l\) is root-\(n\) consistent and asymptotically normal,
provided that the \(\Gamma_i\) are
cross-fitted and that the product of the convergence rates of the
estimators of the nuisance functions \(\mu
\left( \cdot, \cdot \right)\) and \(p
\left( \cdot \right)\) is faster than \(n^{1/2}\).