The package provides a tool to statistically assess presence of trajectory in data. The function, test.trajectory()
, implements the tree dimension test (TDT) (Tenha and Song 2022). It takes as input a matrix with rows as observations and columns as features. The output from the function is a list containing the tree dimension test measure (TDT), tree dimension test effect, \(S\) statistic and the p.value for TDT.
The example below illustrates the application of TDT to test presence of trajectory in simulated singlecell RNAseq gene expression data that has a trajectory. The dataset is stored in matrix input
(Cannoodt et al. 2018), where rows are cells and columns are genes. TDT is able to recognize the presence of trajectory as depicted by the significant TDT \(p\)value.
require(TreeDimensionTest)
=system.file('extdata', 'bifurcating_7.rds', package = 'TreeDimensionTest')
data_path
= readRDS(data_path) input
Next we run the test.trajectory function on matrix input
, with the other parameters set to default.
= test.trajectory(input) res
The test returns a list containing tree dimension measure (tdt_measure
), effect (tdt_effect
), test statistic, \(p\)value, number vertices that are leaves, and tree diameter. Here, the \(p\)value is significant and tdt_effect
is strong, depicting presence of trajectory.
Test results contained in res 


tdt_measure  2.6666667 
statistic  74.8848818 
tdt_effect  0.7488488 
leaves  26.0000000 
diameter  41.0000000 
p.value  0.0046336 
original_dimension  7848.0000000 
pca_components  3.0000000 
We visualize the MST and the scatterplot using principal components.
plot(res, node.size=12, node.col="mediumseagreen",
main = paste0("Trajectory presence\npvalue = ",
format.pval(res$p.value, digit=2)))
=prcomp(input)
pcplot(pc$x[,c(1:2)], xlab="PC1", ylab="PC2", cex=1,
sub="Principal components", col="mediumseagreen",
main = paste0("Trajectory presence\npvalue = ",
format.pval(res$p.value, digit=2),
"\n(n = ", nrow(input), ")")
)
Here we demonstrate using test.trajectory on the input
matrix with modified parameter values. dim.reduction="pca"
means dimensionality reduction will be performed first using principal component analysis. The number of principal components is selected using the Scree test. One can set dim.reduction="none"
if dimensionality reduction is unnecessary. MST="exact"
; the exact Euclidean MST (EMST) is used. The alternative is the approximate and fast dualtree Boruvka algorithm by setting MST="boruvka"
to obtain an approximate EMST.
= test.trajectory(input, dim.reduction = "pca", MST = "exact") res
Test results contained in res 


tdt_measure  2.6666667 
statistic  74.8848818 
tdt_effect  0.7488488 
leaves  26.0000000 
diameter  41.0000000 
p.value  0.0046336 
original_dimension  7848.0000000 
pca_components  3.0000000 
It can be seen from the results that the exact MST and boruvka MST are equivalent for small sample size.
We simulated spherical bivariate normal data, which are isotropic and contain no trajectory. The \(p\)value for the test is insignificant.
= 100
n = cbind(rnorm(n), rnorm(n))
mat = test.trajectory(mat, dim.reduction = "none") res
Test statistics and \(p\)value are contained in res
:
Test results contained in res 


tdt_measure  3.4242424 
statistic  68.4820747 
tdt_effect  0.6848207 
leaves  27.0000000 
diameter  32.0000000 
p.value  0.8482481 
We visualize the data which show no sign of trajectory presence.
plot(res, node.size=12,
main = paste0("Trajectory presence\npvalue = ",
format.pval(res$p.value, digit=2)))
plot(mat, col="orange", pch=2, cex=0.5, xlab="Dim 1", ylab="Dim 2",
main=paste0("Trajectory presence\npvalue = ",
format.pval(res$p.value, digit=2),
"\n(n = ", n, ")")
)
TDT statistics can be calculated relatively quickly to learn if there is any effect in the data before launching the more time consuming step of calculating the \(p\)value.
Function compute.stats()
computes tree dimension measure, tree dimension effect, number of leafs and diameter of EMST for a given input data without calculating the \(p\)value.
= compute.stats(mat, MST="boruvka", dim.reduction = "none") res
Results contained in res 


tdt_measure  3.4242424 
tdt_effect  0.6848207 
leaves  27.0000000 
diameter  32.0000000 
Function separability()
computes heterogeneity of observations of the same type in a given data (Tenha and Song 2022). Observations of the same type have the same label. The function takes a matrix as input with rows as observations and columns as features. The function also takes a vector of labels for the observations. The function returns separability values for each label type and the overall separability value. The separability values range from 0 to 1, with 1 being the highest separability.
In the examples below, there are three types of observations labeled L1, L2 and L3. An instance of real application is in singlecell data, where the labels could be cell types.
#Random data
= cbind(rnorm(200), rnorm(200))
mat
#Labels for the samples in the data
= c(rep("L1", 93), rep("L2",78), rep("L3",29))
labels
#Color vector of samples, each unique color correspods with unique label
= c(rep("orange", 93), rep("mediumseagreen", 78), rep("purple", 29))
cols
#Compute separability of samples in mat
= separability(mat, labels) res
The result is a list containing separability values for each label and, overall separability on the data. The overall separability is relatively low, implying samples with same labels are mixed.
::kable(res$label_separability, col.names = "Label separability") knitr
Label separability  

L1  0.5535714 
L2  0.4698795 
L3  0.2710280 
#Plots an MST of the data, with samples of the same label highlighted by same color
# plotTree(mat, labels, node.size = 12, node.col = cols,
# main = paste("Low seperability", format(res$overall_separability, digits = 3)),
# legend.cord=c(2.1,0.9)
# )
plot(res,node.size = 12, node.col = cols,
main = paste("Low seperability", format(res$overall_separability, digits = 3)), legend.cord=c(2.1,0.9))
An example where samples with the same label are close together.
= rbind(
mat cbind(rnorm(93,mean=20), rnorm(93, mean=20)),
cbind(rnorm(78,mean=5), rnorm(78,mean=5)),
cbind(rnorm(29, mean=50), rnorm(29, mean=50)))
= c(rep("L1", 93), rep("L2",78), rep("L3",29))
labels = separability(mat, labels) res
The overall separability is 1, implying samples of different labels are perfectly separated.
::kable(res$label_separability, col.names = " Label separability") knitr
Label separability  

L1  1 
L2  1 
L3  1 
#Color vector of samples corresponding to labels
= c(rep("orange", 93), rep("mediumseagreen", 78), rep("purple", 29))
cols
# plotTree(
# mat,labels, node.size=12, node.col = cols,
# main = paste("High seperability", format(res$overall_separability, digits = 3)),
# legend.cord=c(1.9,0.9))
plot(res,node.size = 12, node.col = cols,
main = paste("High seperability", format(res$overall_separability, digits = 3)), legend.cord=c(2.1,0.9))
We now illustrate the use of separability to compute tissue specificity for calcium signaling and ribosome pathways on developing mouse data (CardosoMoreira et al. 2019) with samples of different tissue types.
#Loading calcium signaling pathway data from Mouse
#This is Mouse development RNAseq data spanned by the geneset fo calcium signaling pathway
#Rows are genes and columns are samples.
=system.file('extdata', 'calcium_pathway_data.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
#loading color vector of samples by label type; mouse_cols
=system.file('extdata', 'mouse_cols.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
#Labels of the samples are the column names of the data, which are names of tissue types.
= colnames(calcium_pathway_data)
labels
= separability(t(calcium_pathway_data), labels)
res
#Separabiltiy for each tissue type as well as the overall separability. High separability depicts high tissue specificity.
# plotTree(
# t(calcium_pathway_data), labels, node.col=mouse_cols,node.size=12,
# main = paste("Calcium signaling pathway\nTissue specificity",
# format(res$overall_separability, digits = 3)),
# legend.cord=c(1.9,1.3))
plot(res,node.size = 12, node.col=mouse_cols,
main = paste("Calcium signaling pathway\nTissue specificity",
format(res$overall_separability, digits = 3)), legend.cord=c(1.9,1.3))
Calcium signaling pathway has high tissue specificity as depicted by the high separability value. Tissues of the same type are closer together as shown in the plot.
::kable(res$label_separability, col.names = "Calcium signaling pathway tissue specificity") knitr
Calcium signaling pathway tissue specificity  

Brain  0.7971014 
Cerebellum  1.0000000 
Heart  1.0000000 
Kidney  1.0000000 
Liver  0.9047619 
Ovary  0.9032258 
Testis  0.8709677 
#Loading ribosome pathway data from Mouse
=system.file('extdata', 'ribosome_pathway_data.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
#loading color vector of samples by label type; mouse_cols
=system.file('extdata', 'mouse_cols.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
# ribsome_pathway_data is RNAseq data with rows as genes and columns as samples
= colnames(ribosome_pathway_data)
labels = separability(t(ribosome_pathway_data), labels)
res plot(
node.col= mouse_cols,
res, node.size=12,
main = paste("Ribosome pathway\nTissue specificity",
format(res$overall_separability, digits = 3)),
legend.cord=c(1.9,1.3))
The ribosome pathway has relatively lower tissue specificity as depicted by the lower separability value. Tissues of the same type are mixed as shown in the plot.
::kable(res$label_separability, col.names = "Ribosome pathway tissue specificity") knitr
Ribosome pathway tissue specificity  

Brain  0.5140187 
Cerebellum  0.7818182 
Heart  0.6179775 
Kidney  0.6219512 
Liver  0.5428571 
Ovary  0.4057971 
Testis  0.3913043 