Introduction to Partition

Malcolm Barrett

2024-01-24

Introduction to the partition package

partition is a fast and flexible data reduction framework for R (Millstein et al. 2020). There are many approaches to data reduction, such as principal components analysis (PCA) and hierarchical clustering (both supported in base R). In contrast, partition attempts to create a reduced data set that is both interpretable (each raw feature maps to one and only one reduced feature) and information-rich (reduced features must meet an information constraint). Reducing the data this way often results in a data set that has a mix of raw features from the original data and reduced features.

partition is particularly useful for highly correlated data, such as genomic data, where there is a lot of redundancy. A simple model of say, gene expression data could be block correlated Gaussian variables. simulate_block_data() simulates data like this: blocks of correlated data that are themselves independent of the other blocks in the data.

library(partition)
library(ggplot2)
set.seed(1234)
# create a 100 x 15 data set with 3 blocks
df <- simulate_block_data(
  # create 3 correlated blocks of 5 features each
  block_sizes = rep(5, 3),
  lower_corr = .4,
  upper_corr = .6,
  n = 100
)

In a heatmap showing the correlations between the simulated features, blocks of correlated features are visible:

ggcorrplot::ggcorrplot(corr(df))

Many types of data follow a pattern like this. Closely related to the block correlation structure found in genetic data is that found in microbiome data. The data set baxter_otu has microbiome data on 172 healthy patients. Each row represents a patient, and each column represents an Operational Taxonomic Unit (OTU). OTUs are species-like relationships between bacteria determined by analyzing their RNA. Each cell in the dataset represents the logged-count of an OTU found in a patient’s stool sample, with 1,234 OTUs in all.

baxter_otu
#> # A tibble: 172 × 1,234
#>    otu_1 otu_2 otu_3 otu_4 otu_5 otu_6 otu_7 otu_8 otu_9 otu_10 otu_11 otu_12
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
#>  1  0     0        0  1.39     0  0        0     0     0  2.08   0       0   
#>  2  2.48  0        0  0        0  0        0     0     0  0      0       0   
#>  3  1.10  0        0  3.26     0  0        0     0     0  0      0       0   
#>  4  1.39  0        0  1.10     0  1.10     0     0     0  0.693  0.693   0   
#>  5  0     0        0  3.00     0  0        0     0     0  0      0       2.40
#>  6  0     0        0  0        0  0        0     0     0  0      0       0   
#>  7  0     0        0  0        0  0        0     0     0  0      0       0   
#>  8  0     0        0  0        0  0        0     0     0  0      0       0   
#>  9  0     0        0  0        0  0        0     0     0  0      0       1.79
#> 10  1.39  1.79     0  0        0  0        0     0     0  0      0       0   
#> # ℹ 162 more rows
#> # ℹ 1,222 more variables: otu_13 <dbl>, otu_14 <dbl>, otu_15 <dbl>,
#> #   otu_16 <dbl>, otu_17 <dbl>, otu_18 <dbl>, otu_19 <dbl>, otu_20 <dbl>,
#> #   otu_21 <dbl>, otu_22 <dbl>, …

While not as apparent as simulated data, correlated blocks also appear in these data; bacteria tend to group together into communities or cliques in the microbiomes of participants. Here are the first 200 OTUs:

correlation_subset <- corr(baxter_otu[, 1:200])
ggcorrplot::ggcorrplot(correlation_subset, hc.order = TRUE) + ggplot2::theme_void()