# Causal Conditional Distance
Correlation

#### Eric W. Bridgeford

#### 2024-03-01

```
require(causalBatch)
require(ggplot2)
require(tidyr)
n = 200
```

To start, we will begin with a simulation example, similar to the
ones we were working in for the simulations, which you can access
from:

`vignette("cb.simulations", package="causalBatch")`

Letâ€™s regenerate our working example data with some plotting
code:

```
# a function for plotting a scatter plot of the data
plot.sim <- function(Ys, Ts, Xs, title="",
xlabel="Covariate",
ylabel="Outcome (1st dimension)") {
data = data.frame(Y1=Ys[,1], Y2=Ys[,2],
Group=factor(Ts, levels=c(0, 1), ordered=TRUE),
Covariates=Xs)
data %>%
ggplot(aes(x=Covariates, y=Y1, color=Group)) +
geom_point() +
labs(title=title, x=xlabel, y=ylabel) +
scale_x_continuous(limits = c(-1, 1)) +
scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
theme_bw()
}
```

Next, we will generate a simulation:

```
sim = cb.sims.sim_sigmoid(n=n, eff_sz=1, unbalancedness=1.5)
plot.sim(sim$Ys, sim$Ts, sim$Xs, title="Sigmoidal Simulation")
```

Despite the fact that the covariate distributions for each
group/batch do not overlap perfectly (note that
`unbalancedness`

is not \(1\)), it looks like the two batches still
appear to be slightly different. We can test this using the causal
conditional distance correlation, like so:

`result <- cb.detect.caus_cdcorr(sim$Ys, sim$Ts, sim$Xs, R=100)`

Here, we set the number of null replicates `R`

to \(100\) to make the simulation run faster,
but in practice you should typically use at least \(1000\) null replicates. To make this
faster, we would suggest setting `num.threads`

to be close to
the maximum number of cores available on your machine. You can identify
the number of cores available on your machine using
`parallel::detectCores()`

.

With the \(\alpha\) of the test at
\(0.05\), we see that the \(p\)-value is:

```
print(sprintf("p-value: %.4f", result$Test$p.value))
#> [1] "p-value: 0.0099"
```

Since the \(p\)-value is \(< \alpha\), we reject the null
hypothesis in favor of the alternative; that is, that the group/batch
causes a difference in the outcome variable.

We could optionally have pre-computed a distance matrix for the
outcomes, like so:

```
# compute distance matrix for outcomes
DY = dist(sim$Ys)
```

In your use-cases, you could substitute this distance function for
any distance function of your choosing, and pass a distance matrix
directly to the detection algorithm, by specifying that
`distance=TRUE`

:

`result <- cb.detect.caus_cdcorr(DY, sim$Ts, sim$Xs, distance=TRUE, R=100)`