1 Introduction

Biomarkers are widely used in pharmaceutical industry for drug discovery and development at various stages, from preclinical animal study to phase I- III and post market clinical trials, and can be used for target identification, diseased diagnostics, patient stratification, treatment prediction and etc. High-throughput assay technology enables the collection of various types of biomarkers, such as gene expression and various omics biomarkers. In high-throughput assays, a large number of biomarkers are measured in a single experiment, but subject to substantial known or unknown variability. Hence biomarkers from high-throughput assays yield two major characteristics, including high dimensionality of the data and relative large data variability. Due to the two characteristics, it is critical and challenging to visualize the biomarker data and corresponding statistical results to ensure the data quality and the reliability of downstream statistical analysis results. Therefore we developed this R package as a visualization tool for generating the commonly used plots in analyzing biomarkers from high-throughput assays, including data quality control, individual biomarker analysis and multivariate analysis. The tool also included an analysis pipeline for analyzing biomarkers in the setting of two groups comparison with the flexibility to extend to customized or project specific analysis.

The R package statVisual provide novel solutions to the users by utilizing many powerful R base functions and R packages. For example, the function hist in the R package graphics can draw the histogram for a set of observations. However, to visualize histograms for two or more groups of observations in one figure, the users need to write their own code. The R package statVisual provides the function Hist to help the users to obtain such figure in one command. This vignette illustrates the usages of the functions provided by the R package statVisual.

2 System Requirements

R (>=3.5.0) is required to run the R package statVisual properly.

3 Load Packages

The following R packages are required to be installed before installing and running statVisual:

# packages in Bioconductor
library(Biobase)    # base package for Bioconductor
library(limma)      # linear models for continuous omics data
library(pvca)       # principal variance component analysis

# packages in CRAN
library(dplyr)      # data manipulation and pipe operation
library(factoextra) # extract and visualize results of multivariate data analysis
library(forestplot) # forest plot
library(gbm)        # generalized boosted regression models
library(GGally)     # extension to 'ggplot2'
library(ggdendro)   # dendrogram for data clustering
library(ggfortify)  # data visualization tools for statistical analysis results
library(ggplot2)    # create graphics based on "The Grammer of Graphics"
library(ggrepel)    # tidy text display in ggplot
library(glmnet)     # cross validation plot for glmnet
library(grDevices)  # R graphics devices and support for colors and fonts
library(gridExtra)  # Grid graphics
library(knitr)      # dynamic report generation
library(methods)    # formal methods and classes
library(pROC)       # display and analyze ROC curves
library(randomForest) # Random forest variable importance
library(reshape2)   # flexibly reshape data
library(rmarkdown)  # dynamic documents for R
library(rpart.plot) # plots for recursive partitioning for classification, regression and survival trees
library(tibble)     # simple data frames
library(stats)      # basic statistical functions

To load statVisual package, please type the following R statement:

library(statVisual)

To check the information about the statVisual package, please type the following R statement:

library(help = statVisual)

To find the usage of a function (e.g., Hist) in statVisual, please use the help function or use ?. For example,

help(Hist)
?Hist

4 Available functions and datasets

4.1 Available functions

Below is a list of currently available functions in statVisual for plotting.

  • Analysis focusing on one outcome variable:
    • Hist - Compare groups based on histograms
    • Den - Compare groups based on density plots
  • Analysis focusing on two outcome variables:
    • XYscatter - Compare groups based on scatter plots
    • stackedBarPlot - Compare groups based on bar plots
    • BiAxisErrBar - Compare groups based on bi-axis error bar plots
  • Analysis of longitudinal data:
    • LinePlot - Compare groups based on trajectory plots (commonly used to visualize tumor volume data)
    • Box - Compare groups based on boxplots across time
    • ErrBar - Compare groups based on dotplots across time
    • barPlot - Compare groups based on barplots across time
  • Analysis focusing on pattern detection:
    • Dendro - Compare groups based on dendrogram of hierarchical clustering
    • iprcomp - Calculate principal components (missing values allowed)
    • PCA_score - Scatter plot of the 2 specified PCs
    • Heat - Heatmap with row names colored by groups
    • PVCA - PVCA plot
    • Volcano - Volcano Plot with option to label significant results
  • Analysis focusing on prediction:
    • BoxROC - Compare boxplots with ROC curve
    • cv_glmnet_plot - Cross validation plot of glmnet
    • ImpPlot - Plot of variable importance
  • The overall wrapper function:
    • statVisual - the wrapper function

4.2 Available datasets

  • diffCorDat: A simulated dataset for illustrating differential correlations between cases and controls

The simulated data set diffCorDat contains expression levels of 2 gene probes for 50 cases and 50 controls. The expression levels of probe1 are generated from \(N(0, 1)\). The expression levels of probe2 for controls are also generated from \(N(0, 1)\). The expression levels of probe 2 for cases are generated from the formula \[\begin{equation} probe2_{i} = -probe1_{i} + e_i, i=1, \ldots, nCases, \end{equation}\] where \(e_i\sim N(0, 0.3^2)\).

That is, the expression levels of probe 1 and probe 2 are negatively correlated in cases, but not correlated in controls.

To load diffCorDat, we can use the following R code:

data(diffCorDat)

print(dim(diffCorDat))
#> [1] 100   3
print(diffCorDat[1:2,])
#>      probe1     probe2   grp
#> 1 0.1567038  0.2438042 cases
#> 2 1.3738112 -1.5249113 cases
  • esSim: A simulated gene expression dataset for differential expression analysis

The dataset esSim was generated based on the R code in the manual of the function of the R Bioconductor package . There are 100 probes and 20 samples (10 controls and 10 cases). The first 3 probes are over-expressed in cases. The 4-th and 5-th probes are under-expressed in cases. The remaining 95 probes are non-differentially expressed between cases and controls. Expression levels for 100 probes were first generated from normal distribution with mean 0 and standard deviation varying between probes (\(sd=0.3\sqrt{4/\chi^2_4}\)). For the 3 OE probes, we add 2 to the expression levels of the 10 cases. For the 2 UE probes, we subtract 2 from the expression levels of the 10 cases.

To load esSim, we can use the following R code:

data(esSim)

print(dim(esSim))
#> Features  Samples 
#>      100       20
print(esSim)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 100 features, 20 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: subj1 subj2 ... subj20 (20 total)
#>   varLabels: sid grp
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: probe1 probe2 ... probe100 (100 total)
#>   fvarLabels: probeid gene chr
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
  • genoSim: A simulated genotype dataset

    genoSim is an ExpressionSet object containing genotype data of 10 SNPs for 100 subjects (50 cases and 50 controls). Eight of SNPs have same minor allele frequency (\(MAF=0.2\)) between cases and controls. The other 2 SNPs have the different MAFs between cases and controls (\(MAF_{cases}=0.4\) and \(MAF_{controls}=0.2\)).

    We assume Hardy-Weinberg Equilibrium. That is, the genotype for wild-type homozygote is \((1-MAF)^2\); the genotype for heterozygote is \(2*MAF*(1-MAF)\); and the genotype for mutant homozygote is \(MAF^2\).

The phenotype of the ExpressionSet object contains two variables: subject id (\(sid\)) and case-control status (\(grp\)).

The feature data contains two variables: snp id (\(snp\)) and SNP significance status (\(memSNPs\)).

  • longDat: A simulated dataset for longitudinal data analysis

The dataset longDat is generated from the following mixed effects model for repeated measures: \[\begin{equation} y_{ij}=\beta_{0i}+\beta_1 t_{j} + \beta_2 grp_{2i} + \beta_3 grp_{3i} + \beta_4 \times\left(t_{j}\times grp_{2i}\right) + \beta_5 \times\left(t_{j}\times grp_{3i}\right) +\epsilon_{ij}, \end{equation}\] where \(y_{ij}\) is the outcome value for the \(i\)-th subject measured at \(j\)-th time point \(t_{j}\), \(grp_{2i}\) is a dummy variable indicating if the \(i\)-th subject is from group 2, \(grp_{3i}\) is a dummy variable indicating if the \(i\)-th subject is from group 3, \(\beta_{0i}\sim N\left(\beta_0, \sigma_b^2\right)\), \(\epsilon_{ij}\sim N\left(0, \sigma_e^2\right)\), \(i=1,\ldots, n, j=1, \ldots, m\), \(n\) is the number of subjects, and \(m\) is the number of time points.

When \(t_j=0\), the expected outcome value is \[\begin{equation} E\left(y_{ij}\right)=\beta_0+\beta_2 dose_{2i} + \beta_3 dose_{3i}. \end{equation}\] Hence, we have at baseline \[\begin{equation} E\left(y_{ij}\right)=\beta_0,\; \mbox{for dose 1 group},\\ E\left(y_{ij}\right)=\beta_0 + \beta_2,\; \mbox{for dose 2 group},\\ E\left(y_{ij}\right)=\beta_0 + \beta_3,\; \mbox{for dose 3 group}. \end{equation}\]

For dose 1 group, the expected outcome values across time is \[\begin{equation} E\left(y_{ij}\right)=\beta_0+\beta_1 t_{j}. \end{equation}\]

We also can get the expected difference of outcome values between dose 2 group and dose 1 group, between dose 3 group and dose 1 group, and between dose 3 group and dose 2 group: \[\begin{equation} E\left(y_{ij} - y_{i'j}\right) =\beta_2+\beta_4 t_{j},\;\mbox{for subject $i$ in dose 2 group and subject $i'$ in dose 1 group}, \end{equation}\]

\[\begin{equation} E\left(y_{kj} - y_{i'j}\right) =\beta_3+\beta_5 t_{j},\;\mbox{for subject $k$ in dose 3 group and subject $i'$ in dose 1 group}, \end{equation}\]

\[\begin{equation} E\left(y_{kj} - y_{ij}\right) =\left(\beta_3 -\beta_2\right)+\left(\beta_5-\beta_4\right) t_{j},\;\mbox{for subject $k$ in dose 3 group and subject $i$ in dose 2 group}. \end{equation}\]

We set \(n=90\), \(m=6\), \(\beta_0=5\), \(\beta_1=0\), \(\beta_2=0\), \(\beta_3=0\), \(\beta_4=2\), \(\beta_5=-2\), \(\sigma_e=1\), \(\sigma_b=0.5\), and \(t_{j}=j\), \(j=0, \ldots, m-1\).

That is, the trajectories for dose 1 group are horizontal with mean intercept at \(5\), the trajectories for dose 2 group are linearly increasing with slope \(2\) and mean intercept \(5\), and the trajectories for dose 3 group are linearly decreasing with slope \(-2\) and mean intercept \(5\).

To load longDat, we can use the following R code:

data(longDat)

print(dim(longDat))
#> [1] 540   4
print(longDat[1:2,])
#>   sid  time        y  grp
#> 1   1 time0 4.539033 grp1
#> 2   1 time1 5.738715 grp1

5 Analysis focusing on one outcome variable:

5.1 Compare Groups Based on Histograms

A common task in statistical comparison is to compare the mean values among groups. The reason to comparing the summary statistics (means) is to simplify the problem of comparing two distributions since it is hard to numerically compare two distributions. However, we can easily compare two distributions by visualizing the empirical distributions (e.g., histograms).

To compare histograms across groups, we can use the function Hist:

# expression data
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>           subj1    subj2    subj3    subj4    subj5    subj6    subj7    subj8
#> probe1 1.462929 1.596203 2.252267 1.974616 1.842606 1.677204 2.242665 1.705923
#> probe2 1.574342 2.232537 1.846260 1.980999 2.483777 2.048599 1.786081 1.934728
#>           subj9   subj10     subj11     subj12      subj13    subj14     subj15
#> probe1 1.758028 1.919597 0.07477602  0.3054811  0.12356526 0.3221119 -0.1346480
#> probe2 1.740329 1.942189 0.08308643 -0.1012774 -0.06669481 0.5413783  0.0245575
#>             subj16     subj17    subj18     subj19     subj20
#> probe1 -0.03845011  0.3504460 0.4173905 -0.1438801  0.3335083
#> probe2 -0.03849619 -0.2536194 0.1420066 -0.1072697 -0.1750033

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1

# feature data
fDat = fData(esSim)
print(dim(fDat))
#> [1] 100   3
print(fDat[1:2,])
#>        probeid      gene chr
#> probe1  probe1 fakeGene1   1
#> probe2  probe2 fakeGene2   1

# choose the first probe which is over-expressed in cases
pDat$probe1 = dat[1,]

# check histograms of probe 1 expression in cases and controls
pDat$grp=factor(pDat$grp)
print(table(pDat$grp, useNA = "ifany"))
#> 
#>  0  1 
#> 10 10
Hist(
     data = pDat, 
     y = 'probe1', 
     group = 'grp') 

We also can use the wrapper function statVisual:

statVisual(type = 'Hist', 
       data = pDat, 
       y = 'probe1', 
       group = 'grp') 

5.2 Compare Groups Based on Density Plots

We can compare two distribution by comparing the estimated density functions.

The function Den is used to visualize the differences of densities across groups.


Den( 
    data = pDat, 
    y = 'probe1', 
    group = 'grp') 

We can also use the wrapper function statVisual:

statVisual(type = 'Den',
    data = pDat, 
    y = 'probe1', 
    group = 'grp') 

6 Analysis focusing on two outcome variables:

6.1 Compare Groups Based on Scatter Plots

The correlation is an important statistic to evaluate the linear relationship between two continuous variables. To check the linearity and to visualize the strength of the linear relationship, we can draw scatter plot. Some time, it is of interest to compare the correlations among groups. The function XYscatter can help the comparison by display the scatter plots across groups in one figure:

For example, to check if the relationship between Sepal length vs. Sepal width is the same across different species, we can use the R code:

XYscatter( 
  data = diffCorDat, 
  x = 'probe1', 
  y = 'probe2', 
  group = 'grp', 
  title = 'Scatter Plot: probe1 vs probe2')

We can also use the wrapper function statVisual:

statVisual(type = 'XYscatter',
  data = diffCorDat, 
  x = 'probe1', 
  y = 'probe2', 
  group = 'grp', 
  title = 'Scatter Plot: probe1 vs probe2')

6.2 Compare Groups Based on Bar Plots

For two categorical variables, we can use the function stackedBarPlot to show their association.

For example, in the ExpressionSet object genoSim, there are simulated genotypes of 10 SNPs for 50 cases and 50 control. If we would like to know if the pattern of the genotypes of the SNP 1 in cases is the same as that in controls, we can draw bar plots.

data(genoSim)

pDat = pData(genoSim)
geno = exprs(genoSim)

pDat$snp1 = geno[1,]
print(table(pDat$snp1, pDat$grp, useNA="ifany"))
#>    
#>      0  1
#>   0 37 19
#>   1 11 22
#>   2  2  9
stackedBarPlot(dat = pDat, 
           catVar = "snp1", 
           group = "grp", 
               xlab = "snp1", 
           ylab = "Count", 
           group.lab = "grp",
               title = "Stacked barplots of counts for SNP1",
               catVarLevel = NULL)

We can also use the wrapper function statVisual:

statVisual(type = 'stackedBarPlot',
  dat = pDat, 
  catVar = "snp1", 
  group = "grp", 
  xlab = "snp1", 
  ylab = "Count", 
  group.lab = "grp",
  title = "Stacked barplots of counts for SNP1",
  catVarLevel = NULL)

Note that the input parameter catVarLevel can be used to change the order of the values of catVar shown in x-axis. For example,

statVisual(type = 'stackedBarPlot',
  dat = pDat, 
  catVar = "snp1", 
  group = "grp", 
  xlab = "snp1", 
  ylab = "Count", 
  group.lab = "grp",
  title = "Stacked barplots of counts for SNP1",
  catVarLevel = c(2, 0, 1))

6.3 Compare Groups Based on Bi-Axis Error Bar Plots

Some time, we would like to compare two outcomes with different scales across groups in one figure using error bar plots.

The function BiAxisErrBar can do this task. Each bar plot displays mean standard error.


library(tidyverse)
library(ggplot2)


print(head(mtcars))
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

print(table(mtcars$gear, useNA="ifany"))
#> 
#>  3  4  5 
#> 15 12  5

BiAxisErrBar(
  dat = mtcars,
  group = "gear",
  y.left = "mpg",
  y.right = "wt")

We can also use the wrapper function statVisual:

statVisual(type = "BiAxisErrBar",
  dat= mtcars,
  group = "gear",
  y.left = "mpg",
  y.right = "wt")

7 Analysis of Longitudinal Data

7.1 Compare Groups Based on Trajectory Plots

In clinical trial, it is common to collect data at multiple time points. Therefore, it is natural to compare groups based on the trajectories of individual subjects across time. The function LinePlot can do this task:


LinePlot(
  data = longDat,
  x = 'time',
  y = 'y',
  sid = 'sid',
  group = 'grp')

We also can use the wrapper function statVisual:

statVisual(type = "LinePlot",
  data = longDat,
  x = 'time',
  y = 'y',
  sid = 'sid',
  group = 'grp')

7.2 Compare Groups Based on Box Plots Across time

If there are many individuals in a longitudinal dataset, the trajectory plot might look messy. In this case, we develop the function Box to compare groups using boxplots at each time point. In addition, line segments are used to connect the mean/median of each boxplot of the same group across time to show the differences between the mean trajectories. Note that this function is suitable for the scenarios where observations of all subjects are measured at a few fixed time points.

library(dplyr)
Box( 
    data = longDat, 
    x = 'time', 
    y = 'y', 
    group = 'grp',
    title = "Boxplots across time") 

Or we can use the wrapper function statVisual:

statVisual(type = 'Box', 
           data = longDat, 
           x = 'time', 
           y = 'y', 
           group = 'grp',
       title = "Boxplots across time") 

7.3 Compare Groups Based on Dot Plots Across Time

Similarly, we can use dotplots to replace boxplots and add error bar (mean +/- se) for each dotplot. The reason why not add error bar to boxplot in Box function is to avoid confusing between the error bars and bars in boxplots. The function to compare groups based on dot plots across time is ErrBar. Note that this function is suitable for the scenarios where observations of all subjects are measured at a few fixed time points.

ErrBar(
  data = longDat, 
  x = 'time', 
  y = 'y', 
  group = 'grp',
  title = "Dot plots across time") 

We can also use the wrapper function statVisual:

statVisual(type = 'ErrBar', 
  data = longDat, 
  x = 'time', 
  y = 'y', 
  group = 'grp',
  title = "Dot plots across time") 

7.4 Compare Groups Based on Bar Plots Across Time

Similarly, we can use barplots to replace boxplots and add error bar (mean +/- se) for each barplot. The reason why not add error bar to boxplot in Box function is to avoid confusing between the error bars and bars in boxplots. The function to compare groups based on bar plots across time is barPlot. Note that this function is suitable for the scenarios where observations of all subjects are measured at a few fixed time points.

barPlot(
  data = longDat, 
  x = 'time', 
  y = 'y', 
  group = 'grp',
  title = "Bar plots across time") 

We can also use the wrapper function statVisual:

statVisual(type = 'barPlot', 
  data = longDat, 
  x = 'time', 
  y = 'y', 
  group = 'grp',
  title = "Bar plots across time") 

8 Analysis focusing on pattern detection:

8.1 Compare Groups Based on Dendrogram of Hierarchical Clustering

Hierarchical clustering is an exploratory tool to find patterns in data. Dendrogram can be used to visualize the hierarchical clustering results. By coloring the nodes of the dendrogram based on the group information, we can check if the clusters identified by the hierarchical clustering match with the known groups. The wrapper function Dendro can do this task:

library(ggdendro)
data(esSim)
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>           subj1    subj2    subj3    subj4    subj5    subj6    subj7    subj8
#> probe1 1.462929 1.596203 2.252267 1.974616 1.842606 1.677204 2.242665 1.705923
#> probe2 1.574342 2.232537 1.846260 1.980999 2.483777 2.048599 1.786081 1.934728
#>           subj9   subj10     subj11     subj12      subj13    subj14     subj15
#> probe1 1.758028 1.919597 0.07477602  0.3054811  0.12356526 0.3221119 -0.1346480
#> probe2 1.740329 1.942189 0.08308643 -0.1012774 -0.06669481 0.5413783  0.0245575
#>             subj16     subj17    subj18     subj19     subj20
#> probe1 -0.03845011  0.3504460 0.4173905 -0.1438801  0.3335083
#> probe2 -0.03849619 -0.2536194 0.1420066 -0.1072697 -0.1750033

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1

# choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe)
pDat$probe1 = dat[1,]
pDat$probe2 = dat[2,]
pDat$probe3 = dat[3,]
pDat$probe4 = dat[4,]
pDat$probe5 = dat[5,]
pDat$probe6 = dat[6,]

print(pDat[1:2, ])
#>         sid grp   probe1   probe2   probe3    probe4    probe5    probe6
#> subj1 subj1   1 1.462929 1.574342 2.015327 -2.137848 -1.992898 1.0472686
#> subj2 subj2   1 1.596203 2.232537 2.141727 -2.639981 -1.926786 0.3904537

# check histograms of probe 1 expression in cases and controls
pDat$grp=factor(pDat$grp)
print(table(pDat$grp, useNA = "ifany"))
#> 
#>  0  1 
#> 10 10

Dendro(
       x = pDat[, c(3:8)], 
       group = pDat$grp)

We also can use the wrapper function statVisual:

statVisual(type = 'Dendro', 
           x = pDat[, c(3:8)], 
           group = pDat$grp)

8.2 Scatter Plot of the First 2 PCs

The scatter plot of 2 specified principal components (PCs), e.g., the first 2 PCs, can also be used to detect patterns (e.g., batch effects) in data. By coloring the data points in the scatter plots based on the group information, we can check if the clusters identified by the PCA plot match with the known groups. The wrapper function PCA_Score can do this task.

Note that he R function prcomp could not handle data with missing values. We developed an improved function iprcomp so that it can handle missing values by replacing the missing values in the dataset by median of the corresponding variable. This is just a temporary solution. The user can use their own imputation method before calling R function prcomp.

We first check if iprcomp could capture the pattern in the original data without missing value.

# generate simulated data
set.seed(1234567)
dat.x = matrix(rnorm(500), nrow = 100, ncol = 5)
dat.y = matrix(rnorm(500, mean = 2), nrow = 100, ncol = 5)
dat = rbind(dat.x, dat.y)
grp = c(rep(0, 100), rep(1, 100))
print(dim(dat))
#> [1] 200   5

res = iprcomp(dat, center = TRUE, scale.  =  FALSE)

# for each row, set one artificial missing value
dat.na=dat
nr=nrow(dat.na)
nc=ncol(dat.na)
for(i in 1:nr)
{
  posi=sample(x=1:nc, size=1)
  dat.na[i,posi]=NA
}

res.na = iprcomp(dat.na, center = TRUE, scale.  =  FALSE)

##
# pca plot
##
par(mfrow = c(3,1))
# original data without missing values
plot(x = res$x[,1], y = res$x[,2], xlab = "PC1", ylab  =  "PC2")
# perturbed data with one NA per probe 
# the pattern of original data is captured
plot(x = res.na$x[,1], y = res.na$x[,2], xlab = "PC1", ylab  =  "PC2", main = "with missing values")
par(mfrow = c(1,1))

It looks like iprcomp captures the original pattern by replacing missing values with meidans of corresponding variables. More thorough investigations are warranted.

We next draw pca plot based on the esSim dataset.

data(esSim)
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>           subj1    subj2    subj3    subj4    subj5    subj6    subj7    subj8
#> probe1 1.462929 1.596203 2.252267 1.974616 1.842606 1.677204 2.242665 1.705923
#> probe2 1.574342 2.232537 1.846260 1.980999 2.483777 2.048599 1.786081 1.934728
#>           subj9   subj10     subj11     subj12      subj13    subj14     subj15
#> probe1 1.758028 1.919597 0.07477602  0.3054811  0.12356526 0.3221119 -0.1346480
#> probe2 1.740329 1.942189 0.08308643 -0.1012774 -0.06669481 0.5413783  0.0245575
#>             subj16     subj17    subj18     subj19     subj20
#> probe1 -0.03845011  0.3504460 0.4173905 -0.1438801  0.3335083
#> probe2 -0.03849619 -0.2536194 0.1420066 -0.1072697 -0.1750033

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1

# choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe)
pDat$probe1 = dat[1,]
pDat$probe2 = dat[2,]
pDat$probe3 = dat[3,]
pDat$probe4 = dat[4,]
pDat$probe5 = dat[5,]
pDat$probe6 = dat[6,]

print(pDat[1:2, ])
#>         sid grp   probe1   probe2   probe3    probe4    probe5    probe6
#> subj1 subj1   1 1.462929 1.574342 2.015327 -2.137848 -1.992898 1.0472686
#> subj2 subj2   1 1.596203 2.232537 2.141727 -2.639981 -1.926786 0.3904537

# check histograms of probe 1 expression in cases and controls
pDat$grp=factor(pDat$grp)
print(table(pDat$grp, useNA = "ifany"))
#> 
#>  0  1 
#> 10 10


library(factoextra)

pca.obj = iprcomp(pDat[, c(3:8)], scale. = TRUE)

# scree plot
factoextra::fviz_eig(pca.obj, addlabels = TRUE)

PCA_score(prcomp_obj = pca.obj, 
          dims = c(1, 3),
          data = pDat, 
          color = 'grp',
          loadings = FALSE)

We can also use the wrapper function statVisual:

statVisual(type = 'PCA_score',
           prcomp_obj = pca.obj, 
           dims = c(1, 2),
           data = pDat, 
           color = 'grp',
           loadings = FALSE)

8.3 Heatmap with Row Names Colored by Groups

Heatmap can be used to visualize the patterns among data. Usually results of bi-clustering will be superimposed to the heatmap. To check if the bi-clustering results match the known group information, we can color the nodes (i.e., rownames of the heatmap) by groups. The function Heat can do this task:

data(esSim)
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>           subj1    subj2    subj3    subj4    subj5    subj6    subj7    subj8
#> probe1 1.462929 1.596203 2.252267 1.974616 1.842606 1.677204 2.242665 1.705923
#> probe2 1.574342 2.232537 1.846260 1.980999 2.483777 2.048599 1.786081 1.934728
#>           subj9   subj10     subj11     subj12      subj13    subj14     subj15
#> probe1 1.758028 1.919597 0.07477602  0.3054811  0.12356526 0.3221119 -0.1346480
#> probe2 1.740329 1.942189 0.08308643 -0.1012774 -0.06669481 0.5413783  0.0245575
#>             subj16     subj17    subj18     subj19     subj20
#> probe1 -0.03845011  0.3504460 0.4173905 -0.1438801  0.3335083
#> probe2 -0.03849619 -0.2536194 0.1420066 -0.1072697 -0.1750033

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1

# choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe)
pDat$probe1 = dat[1,]
pDat$probe2 = dat[2,]
pDat$probe3 = dat[3,]
pDat$probe4 = dat[4,]
pDat$probe5 = dat[5,]
pDat$probe6 = dat[6,]

print(pDat[1:2, ])
#>         sid grp   probe1   probe2   probe3    probe4    probe5    probe6
#> subj1 subj1   1 1.462929 1.574342 2.015327 -2.137848 -1.992898 1.0472686
#> subj2 subj2   1 1.596203 2.232537 2.141727 -2.639981 -1.926786 0.3904537

pDat$grp=factor(pDat$grp)
Heat(
     data = pDat[, c(2:8)], 
     group = 'grp') 
     

We also can use the wrapper function statVisual:

statVisual(type = 'Heat', 
           data = pDat[, c(2:8)], 
           group = 'grp')

8.4 PVCA plot

Principal variance component analysis (PVCA) is proposed to estimate the variability of experimental effects including batch by hybridizing two popular data analysis methods: principal component analysis (PCA) and variance components analysis (VCA). It can be used as a screening tool to determine which sources of variability (biological, technical or other) are most prominent in a given microarray dataset (https://www.niehs.nih.gov/research/resources/software/biostatistics/pvca/index.cfm).

The function PVCA draws the plot of the weighted average proportion variance versus effects:

library(pvca)

# create a fake Batch variable
data(esSim)
esSim$Batch=c(rep("A", 4), rep("B", 6), rep("C", 10))
PVCA( 
     clin_data = pData(esSim), 
     clin_subjid = "sid", 
     gene_data = exprs(esSim), 
     batch.factors = c("grp", "Batch"))

We can also use the wrapper function statVisual:


statVisual(type = 'PVCA',
           clin_data = pData(esSim), 
           clin_subjid = "sid", 
           gene_data = exprs(esSim), 
           batch.factors = c("grp", "Batch"))

8.5 Volcano Plot

Volcano plot can be used to check if the significant results are reasonable or not. Intuitively, the significant results (with low p-value) should have large absolute values of regression coefficients (in linear regression) or \(log2\)(odds ratios) (in logistic regression). To draw the plot of the relationship between fold change (odds ratio) vs. \(-log10\)(p value) with the option to label significant results, we can use the function Volcano:

library(ggrepel)
library(limma)

library(ggrepel)
library(limma)

# load the simulated dataset
data(esSim)
print(esSim)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 100 features, 20 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: subj1 subj2 ... subj20 (20 total)
#>   varLabels: sid grp
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: probe1 probe2 ... probe100 (100 total)
#>   fvarLabels: probeid gene chr
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:

# expression levels
y = exprs(esSim)
print(dim(y))
#> [1] 100  20
print(y[1:2,])
#>           subj1    subj2    subj3    subj4    subj5    subj6    subj7    subj8
#> probe1 1.462929 1.596203 2.252267 1.974616 1.842606 1.677204 2.242665 1.705923
#> probe2 1.574342 2.232537 1.846260 1.980999 2.483777 2.048599 1.786081 1.934728
#>           subj9   subj10     subj11     subj12      subj13    subj14     subj15
#> probe1 1.758028 1.919597 0.07477602  0.3054811  0.12356526 0.3221119 -0.1346480
#> probe2 1.740329 1.942189 0.08308643 -0.1012774 -0.06669481 0.5413783  0.0245575
#>             subj16     subj17    subj18     subj19     subj20
#> probe1 -0.03845011  0.3504460 0.4173905 -0.1438801  0.3335083
#> probe2 -0.03849619 -0.2536194 0.1420066 -0.1072697 -0.1750033

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat)
#>           sid grp
#> subj1   subj1   1
#> subj2   subj2   1
#> subj3   subj3   1
#> subj4   subj4   1
#> subj5   subj5   1
#> subj6   subj6   1
#> subj7   subj7   1
#> subj8   subj8   1
#> subj9   subj9   1
#> subj10 subj10   1
#> subj11 subj11   0
#> subj12 subj12   0
#> subj13 subj13   0
#> subj14 subj14   0
#> subj15 subj15   0
#> subj16 subj16   0
#> subj17 subj17   0
#> subj18 subj18   0
#> subj19 subj19   0
#> subj20 subj20   0

# design matrix
design = model.matrix(~grp, data = pDat)
print(design)
#>        (Intercept) grp
#> subj1            1   1
#> subj2            1   1
#> subj3            1   1
#> subj4            1   1
#> subj5            1   1
#> subj6            1   1
#> subj7            1   1
#> subj8            1   1
#> subj9            1   1
#> subj10           1   1
#> subj11           1   0
#> subj12           1   0
#> subj13           1   0
#> subj14           1   0
#> subj15           1   0
#> subj16           1   0
#> subj17           1   0
#> subj18           1   0
#> subj19           1   0
#> subj20           1   0
#> attr(,"assign")
#> [1] 0 1

options(digits = 3)

# Ordinary fit
fit <- lmFit(y, design)
fit2 <- eBayes(fit)

# get result data frame
resFrame = topTable(fit2,coef = 2, number = nrow(esSim))
print(dim(resFrame))
#> [1] 100   6
print(resFrame[1:2,])
#>        logFC AveExpr     t  P.Value adj.P.Val    B
#> probe5 -1.98  -1.082 -19.7 3.04e-15  1.66e-13 25.0
#> probe3  2.05   0.996  19.6 3.31e-15  1.66e-13 24.9
resFrame$sigFlag  =  resFrame$adj.P.Val < 0.05

resFrame$probe  =  rownames(resFrame)
# make sure set NA to genes non-differentially expressed
resFrame$probe[which(resFrame$sigFlag == FALSE)] = NA

print(resFrame[1:2,])
#>        logFC AveExpr     t  P.Value adj.P.Val    B sigFlag  probe
#> probe5 -1.98  -1.082 -19.7 3.04e-15  1.66e-13 25.0    TRUE probe5
#> probe3  2.05   0.996  19.6 3.31e-15  1.66e-13 24.9    TRUE probe3
print(table(resFrame$sigFlag, useNA = "ifany"))
#> 
#> FALSE  TRUE 
#>    95     5
Volcano(
  resFrame = resFrame, 
  stats = 'logFC', 
  p.value = 'P.Value', 
  group = 'sigFlag', 
  rowname.var = 'probe', 
  point.size = 1)

We also can use the wrapper function statVisual:

statVisual(type = 'Volcano',
           resFrame = resFrame, 
           stats = 'logFC', 
           p.value = 'P.Value', 
           group = 'sigFlag', 
           rowname.var = 'probe', 
           point.size = 1)

9 Analysis focusing on prediction:

9.1 Compare Box Plots with ROC Curve

To compare two distributions, we also can use parallel boxplots. Usually, the ultimate purpose of comparing two distribution is to evaluate if the variable can be used to predict the groups with high accuracy. Hence, we develop the function BoxROC to put parallel boxplots and ROC curve in the same figure. The area under ROC curve is also shown in the figure.

library(dplyr)
library(gridExtra)

data(esSim)
print(esSim)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 100 features, 20 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: subj1 subj2 ... subj20 (20 total)
#>   varLabels: sid grp
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: probe1 probe2 ... probe100 (100 total)
#>   fvarLabels: probeid gene chr
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:

# expression data
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>        subj1 subj2 subj3 subj4 subj5 subj6 subj7 subj8 subj9 subj10 subj11
#> probe1  1.46  1.60  2.25  1.97  1.84  1.68  2.24  1.71  1.76   1.92 0.0748
#> probe2  1.57  2.23  1.85  1.98  2.48  2.05  1.79  1.93  1.74   1.94 0.0831
#>        subj12  subj13 subj14  subj15  subj16 subj17 subj18 subj19 subj20
#> probe1  0.305  0.1236  0.322 -0.1346 -0.0385  0.350  0.417 -0.144  0.334
#> probe2 -0.101 -0.0667  0.541  0.0246 -0.0385 -0.254  0.142 -0.107 -0.175

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1
pDat$grp = factor(pDat$grp)

# choose the first probe which is over-expressed in cases
pDat$probe1 = dat[1,]

# check histograms of probe 1 expression in cases and controls
print(table(pDat$grp, useNA = "ifany"))
#> 
#>  0  1 
#> 10 10
BoxROC(
  data = pDat,
  group = 'grp', 
  y = 'probe1', 
  point.size = 1)

We also can use the wrapper function statVisual:

statVisual(type = 'BoxROC', 
           data = pDat, 
           group = 'grp', 
           y = 'probe1', 
           point.size = 1)

#> TableGrob (1 x 1) "arrange": 1 grobs
#>   z     cells    name            grob
#> 1 1 (1-1,1-1) arrange gtable[arrange]

9.2 Cross validation plot for glmnet

Cross validation plot can be used to visualize the estimated performance as a function of Lagrange multiplier. For continuous endpoint, mean square error (MSE) is used as performance metric. The function cv_glmnet_plot can do this task:

library(dplyr)
library(tibble)
library(glmnet)


data(esSim)
print(esSim)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 100 features, 20 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: subj1 subj2 ... subj20 (20 total)
#>   varLabels: sid grp
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: probe1 probe2 ... probe100 (100 total)
#>   fvarLabels: probeid gene chr
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:

# expression data
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>        subj1 subj2 subj3 subj4 subj5 subj6 subj7 subj8 subj9 subj10 subj11
#> probe1  1.46  1.60  2.25  1.97  1.84  1.68  2.24  1.71  1.76   1.92 0.0748
#> probe2  1.57  2.23  1.85  1.98  2.48  2.05  1.79  1.93  1.74   1.94 0.0831
#>        subj12  subj13 subj14  subj15  subj16 subj17 subj18 subj19 subj20
#> probe1  0.305  0.1236  0.322 -0.1346 -0.0385  0.350  0.417 -0.144  0.334
#> probe2 -0.101 -0.0667  0.541  0.0246 -0.0385 -0.254  0.142 -0.107 -0.175

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1

# feature data
fDat = fData(esSim)
print(dim(fDat))
#> [1] 100   3
print(fDat[1:2,])
#>        probeid      gene chr
#> probe1  probe1 fakeGene1   1
#> probe2  probe2 fakeGene2   1

# choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe)
pDat$probe1 = dat[1,]
pDat$probe2 = dat[2,]
pDat$probe3 = dat[3,]
pDat$probe4 = dat[4,]
pDat$probe5 = dat[5,]
pDat$probe6 = dat[6,]

print(pDat[1:2, ])
#>         sid grp probe1 probe2 probe3 probe4 probe5 probe6
#> subj1 subj1   1   1.46   1.57   2.02  -2.14  -1.99   1.05
#> subj2 subj2   1   1.60   2.23   2.14  -2.64  -1.93   0.39

# check histograms of probe 1 expression in cases and controls
print(table(pDat$grp, useNA = "ifany"))
#> 
#>  0  1 
#> 10 10

pDat$grp = factor(pDat$grp)
cv_glmnet_plot(x = as.matrix(pDat[, c(3:8)]), 
               y = pDat$grp, 
               family = "binomial")

We can also use the wrapper function statVisual:

statVisual(type = "cv_glmnet_plot",
           x = as.matrix(pDat[, c(3:8)]), 
           y = pDat$grp, 
           family = "binomial")

9.3 Variable Importance plot

The function ImpPlot can be used to visualize the relative importance of different variables in predicting outcome for random forest model.

library(dplyr)
library(randomForest)
library(tibble)

data(esSim)
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>        subj1 subj2 subj3 subj4 subj5 subj6 subj7 subj8 subj9 subj10 subj11
#> probe1  1.46  1.60  2.25  1.97  1.84  1.68  2.24  1.71  1.76   1.92 0.0748
#> probe2  1.57  2.23  1.85  1.98  2.48  2.05  1.79  1.93  1.74   1.94 0.0831
#>        subj12  subj13 subj14  subj15  subj16 subj17 subj18 subj19 subj20
#> probe1  0.305  0.1236  0.322 -0.1346 -0.0385  0.350  0.417 -0.144  0.334
#> probe2 -0.101 -0.0667  0.541  0.0246 -0.0385 -0.254  0.142 -0.107 -0.175

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1

# choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe)
pDat$probe1 = dat[1,]
pDat$probe2 = dat[2,]
pDat$probe3 = dat[3,]
pDat$probe4 = dat[4,]
pDat$probe5 = dat[5,]
pDat$probe6 = dat[6,]

print(pDat[1:2, ])
#>         sid grp probe1 probe2 probe3 probe4 probe5 probe6
#> subj1 subj1   1   1.46   1.57   2.02  -2.14  -1.99   1.05
#> subj2 subj2   1   1.60   2.23   2.14  -2.64  -1.93   0.39

pDat$grp=factor(pDat$grp)


rf_m = randomForest(
  x = pDat[, c(3:8)], 
  y = pDat$grp, 
  importance = TRUE, proximity = TRUE
)
ImpPlot(rf_m)

We can also use the wrapper function statVisual:

statVisual(type = 'ImpPlot', rf_m)

10 Discussion

The R package statVisual that we developed extends existing R plot functions to help visualizing differences among groups for TM/BM applications.

We notice that R package GGally provides a powerful visualization function ggpairs that can put pairwise histograms, estimated densities, scatter plots, boxplots, correlation coefficients in one figure. For example,

library(GGally)

data(esSim)
dat = exprs(esSim)
print(dim(dat))
#> [1] 100  20
print(dat[1:2,])
#>        subj1 subj2 subj3 subj4 subj5 subj6 subj7 subj8 subj9 subj10 subj11
#> probe1  1.46  1.60  2.25  1.97  1.84  1.68  2.24  1.71  1.76   1.92 0.0748
#> probe2  1.57  2.23  1.85  1.98  2.48  2.05  1.79  1.93  1.74   1.94 0.0831
#>        subj12  subj13 subj14  subj15  subj16 subj17 subj18 subj19 subj20
#> probe1  0.305  0.1236  0.322 -0.1346 -0.0385  0.350  0.417 -0.144  0.334
#> probe2 -0.101 -0.0667  0.541  0.0246 -0.0385 -0.254  0.142 -0.107 -0.175

# phenotype data
pDat = pData(esSim)
print(dim(pDat))
#> [1] 20  2
print(pDat[1:2,])
#>         sid grp
#> subj1 subj1   1
#> subj2 subj2   1

# choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe)
pDat$probe1 = dat[1,]
pDat$probe2 = dat[2,]
pDat$probe3 = dat[3,]
pDat$probe4 = dat[4,]
pDat$probe5 = dat[5,]
pDat$probe6 = dat[6,]

print(pDat[1:2, ])
#>         sid grp probe1 probe2 probe3 probe4 probe5 probe6
#> subj1 subj1   1   1.46   1.57   2.02  -2.14  -1.99   1.05
#> subj2 subj2   1   1.60   2.23   2.14  -2.64  -1.93   0.39

pDat$grp=factor(pDat$grp)


ggpairs(data = pDat, 
    mapping = ggplot2::aes_string(color = 'grp'), 
        columns = c('probe1', 'probe5', 'probe6'), 
        upper = list(continuous = "cor", 
                     combo = "box_no_facet", 
                     discrete = "facetbar", 
                     na = "na"), 
        lower = list(continuous = "points", 
                     combo = "facethist", 
                     discrete = "facetbar", 
                     na = "na"), 
        diag = list(continuous = "densityDiag", 
                    discrete = "barDiag", 
                    na = "naDiag"), 
        xlab = 'X', 
    ylab = 'Y', 
    title = 'Title')

GGally also provides a function ggcorr to superimpose correlations onto the heatmap. For instance,


ggcorr(data = pDat[, c(3:8)], 
       method = 'pairwise', 
       label = TRUE, 
       label_round = 2, 
       label_size = 4)

Our R package statVisual is a useful complement to GGally since statVisual provides many functions (e.g., BoxROC, LinePlot, Box, ErrBar) that GGally does not provide.

We welcome comments and suggestions to improve statVisual.

In future, after appropriate improvements (e.g., replacing the internal datasets with public available datasets or simulated datasets) we will submit statVisual to CRAN (https://cran.r-project.org/) and submit a manuscript to a peer-reviewed journal so that other researchers can use this R package to facilitate their data analyses for TM/BM applications.