library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Thank you for using Cleanet, an unsupervised method for doublet identification and classification. This tutorial will walk you through a standard Cleanet workflow for mass cytometry data. Most of the steps would be the same for flow cytometry data, with the exception of debris depletion, which is more straightforward in that case, because of the scattering channels.
We start by loading some example data. These 10,000 events come from the whole blood of a healthy donor, profiled by CyTOF at the University of Pennsylvania, using the 30-parameter MDIPA panel. There are also two DNA intercalator channels and one channel providing cell type annotations. Channels have been renamed for convenience.
path <- system.file("extdata", "df_mdipa.csv", package="Cleanet")
df_mdipa <- read_csv(path, col_types=cols())
print(df_mdipa)
#> # A tibble: 10,000 × 33
#> CD45 CD196 CD123 CD19 CD4 CD8a CD11c CD16 CD45RO CD45RA CD161 CD194
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 74.9 14.4 4.50 1.99 1.36 11.0 16.8 700. 20.0 2.94 2.23 19.7
#> 2 94.1 25.4 2.19 3.29 27.6 3.88 15.0 878. 86.5 13.2 5.47 10.3
#> 3 60.7 24.0 2.73 1.93 9.22 9.34 18.2 661. 29.5 12.0 4.05 27.5
#> 4 648. 0 1.02 12.2 22.5 29.0 37.7 187. 7.67 284. 41.0 5.85
#> 5 30.7 13.3 0.191 6.01 7.94 21.3 16.0 492. 15.8 3.74 2.12 6.78
#> 6 135. 26.3 7.19 3.73 7.24 10.7 38.5 628. 77.6 4.18 0 25.5
#> 7 46.9 28.1 5.01 0 2.51 3.81 23.9 774. 36.6 0.272 5.15 21.6
#> 8 1508. 17.6 2.75 5.57 7.00 542. 28.8 157. 120. 226. 68.7 85.8
#> 9 99.2 31.5 7.65 3.54 5.29 15.6 28.4 753. 26.0 1.40 10.4 14.8
#> 10 64.7 45.0 7.31 5.79 7.88 25.4 21.8 760. 37.5 3.87 0 11.4
#> # ℹ 9,990 more rows
#> # ℹ 21 more variables: CD25 <dbl>, CD27 <dbl>, CD57 <dbl>, CD183 <dbl>,
#> # CD185 <dbl>, CD28 <dbl>, CD38 <dbl>, CD56 <dbl>, TCRgd <dbl>, CD294 <dbl>,
#> # CD197 <dbl>, CD14 <dbl>, CD3 <dbl>, CD20 <dbl>, CD66b <dbl>,
#> # `HLA-DR` <dbl>, IgD <dbl>, CD127 <dbl>, DNA1 <dbl>, DNA2 <dbl>, label <chr>
The minimal information necessary to run Cleanet is a data frame and a set of columns to use for doublet detection.
cols <- c("CD45", "CD123", "CD19", "CD11c", "CD16",
"CD56", "CD294", "CD14", "CD3", "CD20",
"CD66b", "CD38", "HLA-DR", "CD45RA",
"DNA1", "DNA2")
cleanet_res <- cleanet(df_mdipa, cols, cofactor=5)
#> Starting Cleanet...
#> Doublet simulation using all 10000 events...
#> Done! Found 700 doublets, 7% of total.
The output is a list containing, alongside other model information, a binary array specifying predictions for all events.
The sensitivity value is the fraction of simulated doublets that Cleanet correctly classifies as doublets. It is an internal measure of model confidence.
We can verify the predictions on a DNA/CD45 bivariate plot: events predicted to be doublets have higher values for DNA1, as expected.
ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=cleanet_res$status)) +
geom_point(size=0.2) +
scale_color_discrete(name="Status") +
theme_bw()
If there is a lot of debris in the sample, the simulation performed by Cleanet can fail: a singlet plus a debris event fail to add up to a doublet. In our file, debris is only 10% of all events, which is acceptable. In general, results can be improved by flagging (some of) the debris events, so that Cleanet knows to exclude them in the simulation. There is a helper function designed for this, which gives visual feedback.
There is a scalar parameter with values between 0 and 1 that can be tuned to flag fewer or more events as debris. The default value of 0.3 works well for MDIPA, but a different one may be appropriate for your panel.
Including the information about debris in the input can help Cleanet make better predictions. The impact is greater for samples that contain large proportions of debris.
cleanet_res <- cleanet(df_mdipa, cols, cofactor=5, is_debris=is_debris)
#> Starting Cleanet...
#> Doublet simulation using 8566 events, 85.7% of total...
#> Done! Found 548 doublets, 5.5% of total.
ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=cleanet_res$status)) +
geom_point(size=0.2) +
scale_color_discrete(name="Status") +
theme_bw()
The CD294 distribution in the sample shows two groups of CD294 positive cells: basophils to the left, eosinophils to the right. In particular, eosinophils have DNA1 values that are similar to those of doublets, making them challenging to distinguish from doublets in bivariate gating. As a multivariate method, Cleanet has no trouble classifying them as singlets.
ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=asinh(CD294/5))) +
geom_point(size=0.2) +
scale_color_gradient(low="black", high="red") +
theme_bw()
Now assume that you have isolated the singlets and classified them,
using your favorite manual or automated method. For our example data,
this information is stored in the label
column. Here is the
breakdown among cell types.
print(table(df_mdipa$label))
#>
#> B cell Basophil Debris Eosinophil Monocyte NK cell Neutrophil
#> 291 34 1125 117 389 511 5512
#> T cell
#> 2021
Cleanet can use this information to extend the classification of
singlets into a classification of doublets (and higher multiplets). The
label information for singlets only must be extracted and passed to the
classify_doublets
function. We will tabulate the output by
doublet type.
singlet_clas <- df_mdipa$label[which(cleanet_res$status!="Doublet")]
doublet_clas <- classify_doublets(cleanet_res, singlet_clas)
sort(table(doublet_clas))
#> doublet_clas
#> B cell_Monocyte Eosinophil_NK cell
#> 1 1
#> Monocyte_Monocyte Monocyte_NK cell
#> 1 1
#> NK cell_NK cell Neutrophil_T cell_T cell
#> 1 1
#> Eosinophil_T cell Monocyte_Neutrophil_Neutrophil
#> 2 2
#> Monocyte_Neutrophil_T cell NK cell_Neutrophil_T cell
#> 2 2
#> B cell_NK cell B cell_Neutrophil_Neutrophil
#> 3 3
#> B cell_T cell Basophil_Neutrophil
#> 3 4
#> Eosinophil_Neutrophil NK cell_T cell
#> 6 7
#> Monocyte_T cell Neutrophil_Neutrophil_Neutrophil
#> 8 9
#> Neutrophil_Neutrophil_T cell B cell_Neutrophil
#> 11 23
#> T cell_T cell Monocyte_Neutrophil
#> 24 30
#> NK cell_Neutrophil Neutrophil_T cell
#> 40 132
#> Neutrophil_Neutrophil
#> 231
Neutrophils account for ~50% of singlets in whole blood.
Unsurprisingly, then, they are also involved in most of the multiplets.
To generalize this intuition, we can compute expected proportions of
doublet types, by multiplying the frequencies of the components. The
function compare_doublets_exp_obs
does this and returns the
proportions of expected and observed doublets as a data frame.
df_exp_obs <- compare_doublets_exp_obs(doublet_clas, singlet_clas, cleanet_res)
#> Joining with `by = join_by(Type)`
arrange(df_exp_obs, -Expected)
#> # A tibble: 43 × 3
#> Type Expected Observed
#> <chr> <dbl> <dbl>
#> 1 Neutrophil_Neutrophil 0.378 0.422
#> 2 Neutrophil_T cell 0.274 0.241
#> 3 NK cell_Neutrophil 0.0719 0.0730
#> 4 Monocyte_Neutrophil 0.0548 0.0547
#> 5 T cell_T cell 0.0497 0.0438
#> 6 B cell_Neutrophil 0.0396 0.0420
#> 7 NK cell_T cell 0.0261 0.0128
#> 8 Monocyte_T cell 0.0198 0.0146
#> 9 Eosinophil_Neutrophil 0.0166 0.0109
#> 10 B cell_T cell 0.0143 0.00547
#> # ℹ 33 more rows
A quick plot can help us compare expected and observed proportions.
ggplot(df_exp_obs, aes(x=Expected, y=Observed)) +
geom_point() +
geom_abline(slope=1, yintercept=0, linetype="dotted") +
theme_bw()
#> Warning in geom_abline(slope = 1, yintercept = 0, linetype = "dotted"):
#> Ignoring unknown parameters: `yintercept`
In this case, expected and observed proportions match very well. Large deviations from the diagonal could be caused by technical factors, or by biological ones such as interactions between specific cell types.