---
title: "Using CliPS to Identify a Dynamic Mixture of Finite Mixtures of Multivariate Gaussian Distributions -- Case Study: Diabetes Data"
author: Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, Bettina Grün
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using CliPS to Identify a Dynamic Mixture of Finite Mixtures of Multivariate Gaussian Distributions -- Case Study: Diabetes Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: telescope.bib 
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

This vignette allows to reproduce the results for the case study
discussed in @Malsiner+Fruehwirth+Gruen:2026 for the `diabetes` data
set from package **mclust**. In particular, we illustrate the CliPS
approach proposed in @Malsiner+Fruehwirth+Gruen:2026 for a Bayesian
multivariate Gaussian mixture with a prior on the number of components
$K$.  First, we fit the dynamic mixture of multivariate Gaussian
mixtures to the data using the telescoping sampler. Then, based on the
posterior draws, we identify the mixture by clustering the component
means in the point process representation. The numbering of the
following steps is aligned with the numbering of the CLiPS procedure
in @Malsiner+Fruehwirth+Gruen:2026.

We start by loading the package, the data and plotting the data.

```{r fig.width=7, fig.height=7}
library("telescope")
data("diabetes", package = "mclust")
y <- diabetes[, c("glucose", "insulin", "sspg")]
z <- diabetes[, "class"]

pairs(y, col = c("darkred", "blue", "darkgreen")[z], pch = 19)
```

# Step 1: Define a mixture model

We use the prior specification and the telescoping sampler for MCMC
sampling as proposed in
@Fruehwirth-Schnatter+Malsiner-Walli+Gruen:2021. In particular, the
prior on the number of components $K$ is selected as a
beta-negative-binomial distribution with parameters $(1,4,3)$. The
mixture weights a-priori have a Dirichlet distribution with parameter
$\alpha/K$ where $\alpha = 0.5$.

```{r}
priorOnK <- priorOnK_spec("BNB_143")
priorOnWeights <- priorOnAlpha_spec("alpha_const", alpha = 0.5)
```

We specify the prior on the component parameters as in
@Fruehwirth-Schnatter+Malsiner-Walli+Gruen:2021.

```{r}
r <- ncol(y)
R <- apply(y, 2, function(x) diff(range(x)))
b0 <- apply(y, 2, median)
B_0 <- rep(1, r)  
B0 <- diag((R^2) * B_0)
c0 <- 2.5 + (r-1)/2
g0 <- 0.5 + (r-1)/2
G0 <- 100 * g0/c0 * diag((1/R^2), nrow = r)
C0 <- g0 * chol2inv(chol(G0))
```

For the sampling, we set the number of burn-in iterations `burnin` to
be discarded and the number of recorded iterations `M` after thinning
with `thin`.

```{r}
M <- 1000
burnin <- 1000
thin <- 1
```

$k$-means clustering into 3 groups is used to get an initial partition
which is used to determine initial values for the component specific
means and covariances. The component weights are initialized with
equal weights.

```{r}
set.seed(4711)
z0 <- kmeans(y, centers = 3, nstart = 100)$cluster
mu <- sapply(split(y, z0), colMeans)
Sigma <- array(sapply(split(y, z0), var), c(r, r, ncol(mu)))
eta <- rep(1/ncol(mu), ncol(mu))
```

# Step 2:  MCMC sampling from the posterior

Using this prior specification as well as initialization and MCMC
settings, we draw samples from the posterior using the telescoping
sampler. For computational ease, we set a maximum number of components
to be considered using `Kmax`.

```{r}
Kmax <- 50
samples <-
    sampleMultNormMixture(
        y, z0, mu, Sigma, eta,
        c0, g0, G0, C0, b0, B0,
        M, burnin, thin, 
        Kmax = Kmax, G = "MixDynamic",
        priorOnK, priorOnWeights = priorOnWeights)
```

The sampling function returns a named list where the sampled
parameters and latent variables are contained. The list includes the
component means `Mu`, the weights `Eta`, the allocations `S`, the
number of observations `Nk` assigned to components, the number of
components `K`, the number of filled components `Kplus`, and the
parameter values corresponding to the mode of the nonnormalized
posterior `nonnormpost_mode_list`. These values are extracted for
further post-processing.

```{r}
Mu <- samples$Mu
Eta <- samples$Eta
S <- samples$S
Nk <- samples$Nk
K <- samples$K
Kplus <- samples$Kplus
nonnormpost_mode_list <- samples$nonnormpost_mode
```

Diagnostic plots of the run show the sampled $K$ and $K_+$ and the
sampled weights $\eta_k$, see Figure 5 of
@Malsiner+Fruehwirth+Gruen:2026.

```{r fig.width=7, fig.height=5}
par(mfrow = c(1, 2))
with(samples,
     matplot(burnin + 1:M, cbind(K, Kplus), type = "l", lty = 1,
             col = c("grey", "black"),
             xlab = "iteration", 
             ylab = expression(`K/` ~K["+"]),
             ylim = c(0, Kmax)))
matplot(burnin + 1:M, samples$Eta, type = "l", lty = 1, col = 1,
        xlab = "iteration", ylim = 0:1,
        ylab = expression(eta["k"]))
```

The following plot shows the pairwise scatter plot of all sampled
component means (within suitable ranges), see Figure 6 in
@Malsiner+Fruehwirth+Gruen:2026.

```{r fig.width=7, fig.height=7}
Mu_ <- do.call("rbind", lapply(1:Kmax, function(k) samples$Mu[,,k])) |>
    as.data.frame() |> na.omit()
colnames(Mu_) <- colnames(y)
Mu_ <- subset(Mu_,
              (glucose >= 50 & glucose <= 320) &
              (sspg >= 0 & sspg <= 500) &
              (insulin >= 300 & insulin <= 1300))
pairs(Mu_, col = rgb(0, 0, 0, 0.2), pch = 19)
```

# Step 2b: Post-processing the MCMC draws

We determine the posterior of the number of filled components.

```{r}
(p_Kplus <- tabulate(Kplus, nbins = max(Kplus))/M)
```

The number of clusters $\hat{K}_+$ is estimated by taking the mode of
the posterior of $K_+$.

```{r}
Kplus_hat <- which.max(p_Kplus)
Kplus_hat
```

The number of draws $M_0$ where $K_+ = \hat{K}_+$ is determined.

```{r}
M0 <- sum(Kplus == Kplus_hat)
M0 
```

We determine the indices of those iterations which have exactly
`Kplus_hat` filled components. For each parameter, we extract those
draws with exactly $\hat{K}_+$ filled components and eliminate the
draws of empty components.

```{r}
index <- Kplus == Kplus_hat

Nk[is.na(Nk)] <- 0
Nk_Kplus <- (Nk[index, ] > 0)

Mu_inter <- Mu[index, , , drop = FALSE]
Mu_Kplus <- array(0, dim = c(M0, r, Kplus_hat)) 
for (j in 1:r) {
  Mu_Kplus[, j, ] <- Mu_inter[, j, ][Nk_Kplus]
}

Eta_inter <- Eta[index, ]
Eta_Kplus <- matrix(Eta_inter[Nk_Kplus], ncol = Kplus_hat)

w <- which(index)
S_Kplus <- matrix(0, M0, nrow(y))
for (i in seq_along(w)) {
  m <- w[i]
  perm_S <- rep(0, Kmax)
  perm_S[Nk[m, ] != 0] <- 1:Kplus_hat
  S_Kplus[i, ] <- perm_S[S[m, ]]
}
```

# Steps 3-4-5: Clustering of the draws 

We call the function `identifyMixture()` of the package **telescope**
to cluster the draws. The argument `Func` contains the array of the
functional values $\phi(\theta_k)$ with dimension ($M_0 \times K_+
\times d$), where $d= dim(\phi(\theta_k))$. This array contains the
draws for clustering.  `Func_init` contains the centers of the
clusters used for initializing $k$-means. The draws in `Mu_Kplus`,
`Eta_Kplus`, `S_Kplus` are reordered according to the classification
sequence obtained with the $k$-means algorithm.

```{r}
Func_init <- t(nonnormpost_mode_list[[Kplus_hat]]$mu)
identified_Kplus <- identifyMixture(
    Func = Mu_Kplus, Mu_Kplus, Eta_Kplus, S_Kplus, Func_init)
```

`identifyMixture()` returns a named list where `S`, `Mu`, and `Eta`
contain the relabed draws after having discarded draws which are not
permutations, `non_perm_rate` gives the non-permutation rate, and
`class` contains the labels of the functionals obtained with the
$k$-means algorithm.

```{r}
identified_Kplus$non_perm_rate
```

The non-permutation rate is $`r identified_Kplus$non_perm_rate`$.

We visualize the relabeled draws of the component means.

```{r fig.width=7, fig.height=7}
Mu_ <- do.call("rbind", lapply(1:Kplus_hat, function(k) identified_Kplus$Mu[,,k])) |>
    as.data.frame()
colnames(Mu_) <- colnames(y)
z_ <- identified_Kplus$class
COLS <- apply(rbind(col2rgb(c("darkred", "blue", "darkgreen")),
                    alpha = 0.2 * 255) / 255,
              2, function(x) do.call("rgb", as.list(x)))
pairs(Mu_, col = COLS[z_], pch = 19)
```

# Step 6: Characterization of the cluster distributions

We use the relabeled draws to characterize the cluster
distribution. We estimate the cluster specific parameters (e.g.,
posterior means and cluster sizes) and determine the final partition
by assigning each observation to the cluster where it was assigned
most frequently. The final partition is stored in `z_sp`. Finally, the
estimated clustering solution is visualized.

```{r fig.width=7, fig.height=7}
colMeans(identified_Kplus$Mu)
colMeans(identified_Kplus$Eta)

z_sp <- apply(identified_Kplus$S, 2,
              function(x) which.max(tabulate(x, Kplus_hat)))
table(z_sp)
table(z, z_sp)

library("mclust")
1 - classError(z_sp, z)$errorRate
adjustedRandIndex(z, z_sp)

pairs(y, col = c("darkred", "blue", "darkgreen")[z_sp], pch = 19)
```

# References

