cytometree
cytometree
cytometree
is a package that implements a binary tree algorithm for the analysis of cytometry data.
Its core algorithm is based on the construction of a binary tree, whose nodes represents cell subpopulations.
At each node, observed cells (or “events”) and markers are modeled by both a normal distribution (so unimodal), and a mixture of 2 normal distributions (so bimodal).
Splitting of the events at each node is done according to a normalized difference of AIC between the two distributional fit (unimodal or bimodal), allowing to pick the marker that best splits those data.
When AIC normalized differences D are not significant anymore, the tree has been constructed, and the cells have been automatically gated (i.e. partitioned).
Given the unsupervised nature of the binary tree, some of the available markers may not be used to find the different cell populations present in a given sample. To recover a complete annotation, we defined, as a post processing procedure, an annotation method which allows the user to distinguish two (or three) expression levels per marker.
cytometree
In this example, we will use a diffuse large B-cell lymphoma dataset (from the FlowCAP-I challenge, with only 3 markers.
First, we need to load the package cytometree
and we can have a look at the data:
## [1] 5524 4
## FL1 FL2 FL4 label
## 1 416 251 293 2
## 2 210 93 184 2
## 3 387 300 233 2
## 4 403 438 288 2
## 5 399 128 116 2
## 6 538 288 189 2
We have 3 markers measured FL1
, FL2
, FL4
as well as the label
obtained from manual gating, and 5524 cells.
# getting the only the cell events with the 3 markers measured:
cellevents <- DLBCL[, c("FL1", "FL2", "FL4")]
# storing the maanual gating reference from FlowCAP-I:
manual_labels <- DLBCL[, "label"]
The function CytomeTree
builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels
attribute):
# Build the binary tree:
Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.1)
# Retreive the resulting partition (i.e. automatic gating):
Tree_Partition <- Tree$labels
One can fiddle with the t
threshold to change the desired depth of the tree.
The function plot_graph
plots a graph representing this binary tree, showing which markers were used in which order to splits the events:
# Plot a graph of the tree (with specific graphical parameters):
plot_graph(Tree, edge.arrow.size = 0.3, Vcex = 0.8, vertex.size = 30)
The function plot_nodes
allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node:
The function Annotation
annotates the subpopulations partitioned by the binary tree:
## FL1 FL2 FL4 leaves count prop
## 1 0 1 0 1 4905 0.8879
## 2 0 0 1 2 590 0.1068
## 3 0 1 1 3 25 0.0045
## 4 1 1 1 4 4 0.0007
The post-processing annotation can be compared to the incomplete annotation obtained from the tree alone:
## FL1 FL2 FL4 labels count prop
## 1 NA NA 0 1 4905 0.8879
## 2 NA 0 1 2 590 0.1068
## 3 0 1 1 3 25 0.0045
## 4 1 1 1 4 4 0.0007
This completed annotation eases the search for specific cell types of interest, where 1
represents a positive measurement of a marker, while 0
corresponds to its absence:
# seeked cell type: FL2+FL4+
pheno_ex1 <- RetrievePops(Annot, phenotypes = list(rbind(c("FL2", 1), c("FL4", 1))))
pheno_ex1$phenotypesinfo
## [[1]]
## [[1]]$phenotype
## [1] "FL2-1" "FL4-1"
##
## [[1]]$proportion
## [1] 0.0052
##
## [[1]]$Mergedlabels
## [1] 3 4
##
## [[1]]$Newlabel
## [1] 5
# One can look for several cell types at once:
phenotypes <- list()
# FL2+FL4-
phenotypes[[1]] <- rbind(c("FL2", 1), c("FL4", 0))
# FL2-FL4+
phenotypes[[2]] <- rbind(c("FL2", 0), c("FL4", 1))
# Retreive corresponding cell populations:
PhenoInfos <- RetrievePops(Annot, phenotypes)
PhenoInfos$phenotypesinfo
## [[1]]
## [[1]]$phenotype
## [1] "FL2-1" "FL4-0"
##
## [[1]]$proportion
## [1] 0.8879
##
## [[1]]$label
## [1] 1
##
##
## [[2]]
## [[2]]$phenotype
## [1] "FL2-0" "FL4-1"
##
## [[2]]$proportion
## [1] 0.1068
##
## [[2]]$label
## [1] 2
Finally, we can compare the automatic gating obtained using cytometree
to the original manual gating (which had 47 events inconsistently gated across different operators performing the manual gating on those very same data - so those 47 cells were identified as outliers in the reference label and were kept out of the evaluation criteria in the FlowCAP-I challenge):
In this example, we will use a dataset from the HIPC (Human Immunology Project Consortium program where a PBMC sample from patient 12828 was analyzed by Stanford.
First, we need to load the package cytometree
and we can have a look at the data:
## [1] 33992 7
## CCR7 CD4 CD45RA HLADR CD38 CD8 label
## [1,] 1561.2429 1052.6527 3169.944 889.99200 1326.2701 2733.810 1
## [2,] 1362.2820 978.1219 2826.469 1810.43335 1294.4667 3054.555 1
## [3,] 1321.3787 1180.9497 2780.128 -57.02759 211.0515 2450.657 1
## [4,] 779.8231 1097.8448 3054.914 -67.70583 579.6187 1898.014 1
## [5,] 956.2596 942.8704 2612.188 60.35059 1776.3430 3262.799 1
## [6,] 1100.1621 1080.4724 2560.610 702.39624 1179.5389 2836.238 1
We have 6 markers measured CCR7
, CD4
, CD45RA
, HLADR
, CD38
, CD8
as well as the label
obtained from manual gating, and 33992 cell.
# getting the only the cell events with the 3 markers measured:
cellevents <- HIPC[, c("CCR7", "CD4", "CD45RA", "HLADR", "CD38", "CD8")]
# storing the maanual gating reference from FlowCAP-I:
manual_labels <- HIPC[, "label"]
The function CytomeTree
builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels
attribute):
# Build the binary tree:
Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.2)
# Retreive the resulting partition (i.e. automatic gating):
Tree_Partition <- Tree$labels
One can fiddle with the t
threshold to change the desired depth of the tree.
The function plot_graph
plots a graph representing this binary tree, showing which markers were used in which order to splits the events:
# Plot a graph of the tree (with specific graphical parameters):
plot_graph(Tree, edge.arrow.size = 0.4, Vcex = 0.45)
The function plot_nodes
allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node:
The function Annotation
annotates the subpopulations partitioned by the binary tree:
## CCR7 CD4 CD45RA HLADR CD38 CD8 leaves count prop
## 5 1 1 1 0 1 0 5 12124 0.3567
## 4 1 1 0 0 0 0 4 8218 0.2418
## 3 1 0 1 0 1 1 3 4588 0.1350
## 8 0 1 0 0 0 0 8 4071 0.1198
## 1 0 0 0 0 0 1 1 3138 0.0923
## 2 0 0 1 0 0 1 2 830 0.0244
## 9 0 1 0 0 1 0 9 475 0.0140
## 6 0 0 0 1 1 1 6 193 0.0057
## 10 0 1 0 1 1 0 10 142 0.0042
## 7 1 0 0 1 1 1 7 102 0.0030
## 11 0 1 0 1 0 0 11 83 0.0024
## 12 0 1 1 1 0 0 12 28 0.0008
This completed annotation eases the search for specific cell types of interest, where 1
represents a positive measurement of a marker, while 0
corresponds to its absence:
# One can look for several cell types at once:
phenotypes <- list()
# CD8-CD4+CCR7-CD45RA+
CD4effector <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 1))
phenotypes[[1]] <- CD4effector
# CD8-CD4+CCR7+CD45RA+
CD4naive <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 1))
phenotypes[[2]] <- CD4naive
# CD8-CD4+CCR7+CD45RA-
CD4CM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 0))
phenotypes[[3]] <- CD4CM
# CD8-CD4+CCR7+CD45RA+
CD4effectorM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 0))
phenotypes[[4]] <- CD4effectorM
# Retreive corresponding cell populations:
PhenoInfos <- RetrievePops(Annot, phenotypes)
# One can find informations about the seeked phenotypes in
PhenoInfos$phenotypesinfo
## [[1]]
## [[1]]$phenotype
## [1] "CD8-0" "CD4-1" "CCR7-0" "CD45RA-1"
##
## [[1]]$proportion
## [1] 8e-04
##
## [[1]]$label
## [1] 12
##
##
## [[2]]
## [[2]]$phenotype
## [1] "CD8-0" "CD4-1" "CCR7-1" "CD45RA-1"
##
## [[2]]$proportion
## [1] 0.3567
##
## [[2]]$label
## [1] 5
##
##
## [[3]]
## [[3]]$phenotype
## [1] "CD8-0" "CD4-1" "CCR7-1" "CD45RA-0"
##
## [[3]]$proportion
## [1] 0.2418
##
## [[3]]$label
## [1] 4
##
##
## [[4]]
## [[4]]$phenotype
## [1] "CD8-0" "CD4-1" "CCR7-0" "CD45RA-0"
##
## [[4]]$proportion
## [1] 0.1404
##
## [[4]]$Mergedlabels
## [1] 8 9 10 11
##
## [[4]]$Newlabel
## [1] 13
# According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7-CD45RA+ population is
# labeled 12 According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+
# population is labeled 5 According to PhenoInfos$phenotypesinfo,
# CD8-CD4+CCR7+CD45RA- population is labeled 4 According to
# PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+ population is comprised of 4
# clusters labeled 8,9,10, 11. They were merged into a new cluster labeled 13.
# The new partition, with merged clusters is to be found in
# PhenoInfos$Mergedleaves.
auto_labels_merged <- PhenoInfos$Mergedleaves
# Get the indices of the subpopulation comprised of classes labeled 12, 5, 4,13.
subpopulation_indices <- auto_labels_merged %in% c(12, 5, 4, 13)
# Scatter plot the data.
CD45RA <- cellevents[subpopulation_indices, "CD45RA"]
CCR7 <- cellevents[subpopulation_indices, "CCR7"]
auto_lab <- factor(auto_labels_merged[subpopulation_indices])
levels(auto_lab) <- viridis::viridis(length(levels(auto_lab)))
auto_lab <- as.character(auto_lab)
plot(CD45RA, CCR7, col = auto_lab, main = "Automatic gating")
It is possible to force the markers which will be used to gate the data first, using the force_first_markers
argument of the CytomeTree()
function. Once those forced markers are used, automatic behavior of the algorithm is resumed for the remaining markers.