- How
are standardized mean differences computed in
`cobalt`

? - Why are mean differences not standardized for binary covariates?
- Why do you use the same standardization factor before and after adjustment?
- How
do I extract the balance tables from the
`bal.tab()`

object? - How are balance statistics computed when using subclassification?
- Why
don’t I get the same balance statistics when using
`cobalt`

as I do when using`tableone`

? - References

`cobalt`

?Most questions below are related to this one, so here I will try to explain in complete detail how standardized mean differences (SMDs) are computed.

First, it is important to know that by default, **mean
differences for binary covariates are not standardized**. That
means in the `Diff.Adj`

column, etc., what you are seeing the
is *raw difference in proportion* for binary variables. (By raw,
I mean unstandardized, but weights may be applied if relevant.) To
request SMDs for binary covariates, set `binary = "std"`

in
the call to `bal.tab()`

or `love.plot()`

. See also
the question below.

For continuous covariates, the standardized mean difference is
computed as \[
\text{SMD} = \frac{\bar{x}_1 - \bar{x}_0}{s^*}
\] where \(\bar{x}_1\) is the
mean of the covariate in the treated group, \(\bar{x}_0\) is the mean of the covariate in
the control group, and \(s^*\) is a
standardization factor (not necessarily a standard deviation!). After
matching or weighting, the weighted standardized mean difference is
computed as \[
\text{SMD}^w = \frac{\bar{x}_{1w} - \bar{x}_{0w}}{s^*}
\] where \(\bar{x}_{1w}\) is the
weighted mean of the covariate in the treated group, i.e., \(\bar{x}_{1w} = \frac{1}{\sum_{i:A_i =
1}{w_i}}\sum_{i:A_i=1}{w_ix_i}\), and similarly for the control
group. Critically, the standardization factor \(s^*\) is the same before and after
weighting. I will repeat, **the standardization factor**
\(s^*\) is the same before and after
weighting. I don’t mean it has the same formula, it mean it is literally
the same value. I explain in more detail in a question below why this is
the case.

How is the standardization factor computed? This depends on the
argument to `s.d.denom`

supplied to `bal.tab()`

or
`love.plot()`

. When `s.d.denom`

is not supplied,
this is determined by the argument supplied to `estimand`

,
and when that is not supplied, the estimand is guess based on the form
of the weights, if any. By default, with no weights supplied and no
argument to `s.d.denom`

or `estimand`

,
`s.d.denom`

is set to `"pooled"`

, and a note will
appear saying so. That note doesn’t appear if weights are supplied or
balance is assessed on the output of a another package, as the estimand,
and therefore `s.d.denom`

, can be determined
automatically.

Below are the formulas for the standardization factor corresponding
to each value of `s.d.denom`

:

`"pooled"`

: \(s^* = \sqrt{\frac{s_1^2 + s_0^2}{2}}\)`"treated"`

: \(s^* = s_1\)`"control"`

: \(s^* = s_0\)`"all"`

: \(s^* = s\)`"weighted"`

: \(s^* = s_w\)`"hedges"`

: \(s^* = \frac{1}{1 - \frac{3}{4(n - 2) - 1}}\sqrt{\frac{(n_1 - 1)s_1^2 + (n_0 - 1)s_0^2}{n - 2}}\)

where \(s_1\) is the standard
deviation of the treated group, \(s_0\)
is the standard deviation of the control group, \(s\) is the standard deviation of the whole
sample ignoring treatment group membership, \(s_w\) is the weighted standard deviation of
the whole sample, \(n_1\) and \(n_0\) are sizes of the treated and control
groups, respectively, and \(n = n_1 +
n_0\).^{1} For continuous covariates, the unweighted
standard deviation is computed as usual, i.e., as \[
s = \sqrt{\frac{1}{n-1}\sum_i{(x_i - \bar{x})^2}}
\] and the weighted standard deviation is computed as \[
s = \sqrt{\frac{\sum_{i} w_{i}}{(\sum_{i} w_{i})^2 - \sum_{i=1}^{n}
w^2_{i}}\sum_i{w_i(x_i - \bar{x}_w)^2}}
\]For binary covariates, the unweighted standard deviation is
computed as \(s =
\sqrt{\bar{x}(1-\bar{x})}\) and the weighted standard deviation
is computed as \(s =
\sqrt{\bar{x}_w(1-\bar{x}_w)}\).

When sampling weights are supplied, all standard deviations in the
standardization factor are computed incorporating the sampling weights.
When `s.d.denom = "weighted"`

, the standardization factor is
computed using the weights used to balance the sample (i.e., the
matching or weighting weights), even for the unadjusted sample.
Remember, the standardization is ALWAYS the same before and after
adjustment.

I know some of these formulas seem overly complicated for such simple statistics, but they are required to keep things consistent and not dependent on the scale of the weights.

Ultimately, bias in the treatment effect estimate is a function of imbalance. That bias is indifferent to whether you measure that imbalance using a standardized or unstandardized mean difference. The reason we use SMDs is that covariates naturally are on a variety of different scales, and when trying to quickly assess whether a sample is balanced, it is productive to unify the scales of the covariates. That way, balance on a covariate measured with large numbers (e.g., days in hospital or prior earnings in dollars) can be assessed alongside balance on a covariate measured with small numbers (e.g., number of comorbidities or years of education).

With binary covariates, though, they are already on a comprehensible scale, so there is no need to standardize. In addition, the scale is intuitive for people; a difference in proportion of .1 when both groups have 100 people means that there is an imbalance of 10 people on the covariate. Are 10 people being different enough to cause bias in the estimate? That can be assessed substantively without needing to take the additional step of translating the variable’s scale into something meaningful.

Another important reason why mean difference are not standardized is
that it is possible for two covariates with the same imbalance to have
vastly different mean differences. For example, consider the following
dataset. `X1`

and `X2`

both have a mean difference
of .1; if they both affected the outcome equally, then each would
contribute to the bias in the estimate to the same extent.

```
treat <- rep(1:0, each = 20)
X1 <- c(rep(0:1, c(1, 19)), rep(0:1, c(3, 17)))
X2 <- c(rep(0:1, c(9, 11)), rep(0:1, c(11, 9)))
bal.tab(treat ~ X1 + X2,
binary = "raw",
disp = "means",
s.d.denom = "treated")
#> Balance Measures
#> Type M.0.Un M.1.Un Diff.Un
#> X1 Binary 0.85 0.95 0.1
#> X2 Binary 0.45 0.55 0.1
#>
#> Sample sizes
#> Control Treated
#> All 20 20
```

But if we standardized the mean differences, not only do we move away
from an actually interpretable statistic (i.e., what does it mean to
divide by the standard deviation of a binary variable?), we see that the
standardized mean difference vary by a huge amount, with `X1`

having twice the imbalance of `X2`

.

```
bal.tab(treat ~ X1 + X2,
binary = "std",
s.d.denom = "treated")
#> Balance Measures
#> Type Diff.Un
#> X1 Binary 0.4588
#> X2 Binary 0.2010
#>
#> Sample sizes
#> Control Treated
#> All 20 20
```

Why does this happen? The standard deviation of a binary variable is a function of its mean (in particular, it is \(s = \sqrt{p(1-p)}\)) where \(p\) is the mean of the variable). That means information about the mean of the variable, which is unrelated to imbalance, contaminates the standardized mean difference, which is supposed to measure imbalance. In this case, standardizing the mean difference only adds confusion and reduces interpretability. That is why, by default, mean differences for binary variables are unstandardized by default.

You can always change this by setting `binary = "std"`

in
the call to `bal.tab()`

or setting
`set.cobalt.options(binary = "std")`

to change the option for
the whole session. One advantage of using standardized mean differences
for binary variables is that they are always larger than the raw mean
difference (because the standardization factor is always less than 1),
which means if you use the standardized mean difference as your balance
criterion, you will always seek better balance than using the raw mean
differences. The balance statistics computed by
`bal.compute()`

that involve the standardized mean difference
standardize all variables, including binary variables.

It is important to remember that bias is a function of the difference in means of a covariate and standardization is a tool to aid in balance assessment. As a tool, it should reflect imbalance accurately (i.e., without incorporating extraneous information), but there is no statistical “truth” about the nature of the standardization factor. I use the same standardization factor before and after adjustment as recommended by Stuart (2008). The rationale is that by isolating the SMD to reflect changes in the difference in means, why can more accurately assess improvement in balance rather than combining information about the difference in means with information about the variability of the covariate, which may change in a variety of ways after adjustment. I describe a specific example of how allowing the standardization factor to change can cause problems here.

`bal.tab()`

object?The output of a call to `bal.tab()`

is a
`bal.tab`

object, which has several components depending on
the features of the dataset (e.g., whether the data are multiply imputed
or clustered or whether the treatment is binary or multi-category,
etc.). In the most basic case, a binary or continuous treatment with no
clustering, no multiple imputations, a single time point, and
subclassification is not used, the balance table is stored in the
`Balance`

component of the output object. Let’s take a
look:

```
data("lalonde")
b <- bal.tab(treat ~ age + educ + race + married + re74,
data = lalonde, s.d.denom = "treated",
disp = "means", stats = c("m", "v"))
# View the structure of the object
str(b, give.attr = FALSE)
#> List of 2
#> $ Balance :'data.frame': 7 obs. of 9 variables:
#> ..$ Type : chr [1:7] "Contin." "Contin." "Binary" "Binary" ...
#> ..$ M.0.Un : num [1:7] 28.03 10.235 0.203 0.142 0.655 ...
#> ..$ M.1.Un : num [1:7] 25.8162 10.3459 0.8432 0.0595 0.0973 ...
#> ..$ Diff.Un : num [1:7] -0.3094 0.055 0.6404 -0.0827 -0.5577 ...
#> ..$ V.Ratio.Un : num [1:7] 0.44 0.496 NA NA NA ...
#> ..$ M.0.Adj : num [1:7] NA NA NA NA NA NA NA
#> ..$ M.1.Adj : num [1:7] NA NA NA NA NA NA NA
#> ..$ Diff.Adj : num [1:7] NA NA NA NA NA NA NA
#> ..$ V.Ratio.Adj: num [1:7] NA NA NA NA NA NA NA
#> $ Observations:'data.frame': 1 obs. of 2 variables:
#> ..$ Control: num 429
#> ..$ Treated: num 185
b$Balance
#> Type M.0.Un M.1.Un Diff.Un V.Ratio.Un M.0.Adj
#> age Contin. 28.0303030 2.581622e+01 -0.30944526 0.4399955 NA
#> educ Contin. 10.2354312 1.034595e+01 0.05496466 0.4958934 NA
#> race_black Binary 0.2027972 8.432432e-01 0.64044604 NA NA
#> race_hispan Binary 0.1421911 5.945946e-02 -0.08273168 NA NA
#> race_white Binary 0.6550117 9.729730e-02 -0.55771436 NA NA
#> married Binary 0.5128205 1.891892e-01 -0.32363132 NA NA
#> re74 Contin. 5619.2365064 2.095574e+03 -0.72108381 0.5181285 NA
#> M.1.Adj Diff.Adj V.Ratio.Adj
#> age NA NA NA
#> educ NA NA NA
#> race_black NA NA NA
#> race_hispan NA NA NA
#> race_white NA NA NA
#> married NA NA NA
#> re74 NA NA NA
```

It’s not a very pretty object, which is why the `print()`

method makes it look nicer. If you are willing to process this table
yourself, you can easily extract it from the `bal.tab()`

output and do what you want with it, e.g., saving it to a CSV file or
making a pretty table using another package. Although I have been
working on a way to do this more easily (i.e., to create a
publication-ready table from a `bal.tab`

object), it might be
a while because the main purpose of this package is balance assessment,
not formatting for publication (although I did put a lot of work into
`love.plot()`

to make it customizable for publication).

Subclassification involves creating strata (usually based on the
propensity score), within which covariates are ideally balanced.
`bal.tab()`

lets you assess balance both within and across
subclasses.

One must always remember that the standardized mean difference uses
the standardization factor computed in the original sample, i.e., prior
to subclassification. Let’s take a look below using
`MatchIt`

:

```
# PS Subclassification
msub <- MatchIt::matchit(treat ~ age + educ + race + married + re74,
data = lalonde, method = "subclass",
estimand = "ATE", min.n = 4)
# Balance in the first subclass
bal.tab(msub, which.sub = 1, binary = "std")
#> Balance by subclass
#> - - - Subclass 1 - - -
#> Type Diff.Adj
#> distance Distance 0.1574
#> age Contin. -1.0433
#> educ Contin. -0.2759
#> race_black Binary 0.0000
#> race_hispan Binary 0.0000
#> race_white Binary 0.0000
#> married Binary -1.1135
#> re74 Contin. -1.8353
```

Let’s see where the number `-1.0433`

came from (the
standardized mean difference for `age`

). We compute the mean
of `age`

in each treatment group in subclass 1, and then
divide it by the pooled standard deviation of age (because we requested
the ATE) in *the original sample*.

```
m0 <- mean(lalonde$age[lalonde$treat == 0 & msub$subclass == 1])
m1 <- mean(lalonde$age[lalonde$treat == 1 & msub$subclass == 1])
s0 <- sd(lalonde$age[lalonde$treat == 0])
s1 <- sd(lalonde$age[lalonde$treat == 1])
(m1 - m0) / sqrt((s1^2 + s0^2) / 2)
#> [1] -1.043294
```

A common mistake is to compute the standard deviation within each subclass. There are a few reasons why this is bad: 1) it suffers from the same problem that changing the standardization factor does with matching or weighting, i.e., that balance can appear to be worse because the standardization factor shrank even as the means got closer together; 2) when there is no or little variation of a covariate within a subclass, which is desirable, the standardization factor will be tiny, making the SMD potentially appear huge; and 3) the same variable will use different standardization factors across subclasses, which means the same difference in means, which contribute to bias equally, will have different balance statistics.

A related question is how the balance statistics are computed across
subclasses to compute an overall balance statistic for the sample. For
(standardized) mean differences, it is as easy as computing the average
of the statistic across subclasses, where the statistics are weighted
corresponding to the number of units in the subclass in the target group
(e.g., the treated units for the ATT, all units for the ATE, etc.).
Below I’ll demonstrate how to do that manually for the `age`

covariate:

```
# SMDs across subclasses for age
smds <- sapply(1:6, function(s) {
m0 <- mean(lalonde$age[lalonde$treat == 0 & msub$subclass == s])
m1 <- mean(lalonde$age[lalonde$treat == 1 & msub$subclass == s])
s0 <- sd(lalonde$age[lalonde$treat == 0])
s1 <- sd(lalonde$age[lalonde$treat == 1])
(m1 - m0) / sqrt((s1^2 + s0^2) / 2)
})
# Sample size in each subclass
ns <- table(msub$subclass)
# Summary SMD for age
weighted.mean(smds, ns)
#> [1] -0.2354095
bal.tab(msub)
#> Balance measures across subclasses
#> Type Diff.Adj
#> distance Distance 0.1081
#> age Contin. -0.2354
#> educ Contin. 0.0075
#> race_black Binary 0.0535
#> race_hispan Binary -0.0420
#> race_white Binary -0.0115
#> married Binary -0.1160
#> re74 Contin. -0.3200
#>
#> Sample sizes by subclass
#> 1 2 3 4 5 6 All
#> Control 102 100 88 72 39 28 429
#> Treated 4 4 9 30 62 76 185
#> Total 106 104 97 102 101 104 614
```

This works for mean differences but not other statistics. So the way
`cobalt`

actually does this is compute stratification
weights, and then compute the balance statistics using the
stratification weights in the full sample. Stratification weights are
first computed by computing the proportion of treated units in each
sample, and then using the formulas to compute propensity score weights
from propensity scores. Here’s how I do that manually for
`age`

:

```
# Compute proportion of treated units in each subclass
prop1 <- sapply(1:6, function(s) mean(lalonde$treat[msub$subclass == s]))
# Assign to each unit
ps <- prop1[msub$subclass]
# Compute ATE weights
w <- ifelse(lalonde$treat == 1, 1 / ps, 1 / (1 - ps))
# Compute weighted KS statistic
col_w_ks(lalonde$age, treat = lalonde$treat,
weights = w)
#> [1] 0.1658923
bal.tab(msub, stats = "ks")
#> Balance measures across subclasses
#> Type KS.Adj
#> distance Distance 0.2187
#> age Contin. 0.1659
#> educ Contin. 0.0627
#> race_black Binary 0.0535
#> race_hispan Binary 0.0420
#> race_white Binary 0.0115
#> married Binary 0.1160
#> re74 Contin. 0.3038
#>
#> Sample sizes by subclass
#> 1 2 3 4 5 6 All
#> Control 102 100 88 72 39 28 429
#> Treated 4 4 9 30 62 76 185
#> Total 106 104 97 102 101 104 614
```

`cobalt`

as I do when using `tableone`

?`tableone`

is another package that provides tools for
balance assessment. One strength that the package has is its beautiful,
publication-ready tables that include summary statistics for the
covariates, clean variable names, and clean headings. But it does not
incorporate best practices in balance assessment in favor of
transparency. This differs from the ethos of `cobalt`

, which
is to provide highly customizable balance statistics that reflect best
practices and use well-reasoned decisions. This is not an insult to
`tableone`

but is meant to reflect the different purposes
`cobalt`

and `tableone`

have. They should not be
used interchangeably or expect to yield identical results because they
use different formulas for computing certain statistics, most notably
the SMD.

Below are some of the reasons why SMDs might differ between
`tableone`

and `cobalt`

:

`tableone`

always uses the pooled standard deviation (i.e., the standardizaton factor setting`s.d.denom = "pooled"`

) as the standardization factor, while`cobalt`

determines the standardization factor based on the estimand (though by default or when the ATE is the estimand, the two should be aligned).`tableone`

uses the weighted standardization factor in the SMD, whereas`cobalt`

always uses the standardization factor computed in the unadjusted sample. For matching, this means`tableone`

computes the standardization factor in the matched sample, while`cobalt`

uses the original sample.`tableone`

uses`survey::svyvar()`

to compute weighted variances, whereas`cobalt`

uses the formula described previously (and implemented in`col_w_sd()`

). These values will differ by small amounts when the weights are not constant.- For multi-category covariates,
`tableone`

uses a single statistic described by Yang and Dalton (2012) to summarize balance, whereas`cobalt`

provides a balance statistic for each level of the covariate. There is no reason to prefer the statistic used by`tableone`

; it does not have any relationship to the bias of the estimate and can mask large differences in some categories when there are many categories. See here for a more detailed answer.

In practice, these differences will be small. Obviously, I recommend
using `cobalt`

instead for balance assessment, and I
recommend reporting the balance statistics `cobalt`

produces.
That said, if you understand what `tableone`

is doing and are
okay with the choices it makes, there is no denying that it can produce
beautiful tables.

Stuart, Elizabeth A. 2008. “Developing Practical Recommendations
for the Use of Propensity Scores: Discussion of ‘A
Critical Appraisal of Propensity Score Matching in the Medical
Literature Between 1996 and 2003’ by Peter Austin,
Statistics in Medicine.” *Statistics
in Medicine* 27 (12): 2062–65. https://doi.org/10.1002/sim.3207.

Yang, Dongsheng, and Jarrod E Dalton. 2012. “A Unified Approach to
Measuring the Effect Size Between Two Groups Using SAS®,” 6.

For multi-category treatments, all standardization factors are computed using the full data, not just the groups being compared. For example, the pooled standard deviation involves computing the mean of all the group-specific variances, not just the two being compared. Similarly, in the

`"hedges"`

formula, \(n-2\) is replaced with \(n-k\), where \(k\) is the number of treatment groups.↩︎