---
title: "Choosing the NMF rank on data with a known true rank"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Choosing the NMF rank on data with a known true rank}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Introduction

Choosing the number of basis vectors (the *rank* `Q`) is the central tuning
problem in NMF.  `nmfkc` offers several rank-selection tools, each based on a
different principle:

| function | principle | a good rank... |
|---|---|---|
| `nmfkc.rank` | integrated diagnostics: R-squared, effective rank, element-wise CV | minimises CV error / R-squared elbow |
| `nmfkc.ecv` | element-wise cross-validation (Wold) | minimises held-out error |
| `nmfkc.bicv` | block bi-cross-validation (Owen & Perry) | minimises held-out error |
| `nmfkc.consensus` | clustering stability (Brunet) | maximises stability (or minimises PAC) |
| `nmfkc.ard` | Bayesian automatic relevance determination (Tan & Fevotte) | surviving components after pruning |

Since there is no single "best" criterion, this vignette applies all of them
to a **synthetic dataset whose true rank is known**, so we can see which ones
recover it.

```{r lib}
library(nmfkc)
```

## A synthetic dataset (true rank = 3)

We build a non-negative `24 x 60` matrix from **three** feature blocks and
**three** sample clusters (20 samples each), then add noise.  By construction
the true rank is `3`.

```{r data, fig.width = 6, fig.height = 4}
set.seed(123)
r_true <- 3
P <- 24                       # features
n_each <- 20                  # samples per cluster
N <- r_true * n_each          # 60 samples

# basis: each latent factor is a distinct block of features
X <- matrix(0, P, r_true)
X[1:8, 1] <- 1; X[9:16, 2] <- 1; X[17:24, 3] <- 1
X <- X + matrix(runif(P * r_true, 0, 0.1), P, r_true)   # small background

# coefficients: each sample loads mainly on its own cluster's factor
cl_true <- rep(1:r_true, each = n_each)
B <- matrix(runif(r_true * N, 0, 0.2), r_true, N)
for (j in seq_len(N)) B[cl_true[j], j] <- runif(1, 1, 2)

Y <- X %*% B
Y <- Y + matrix(rnorm(P * N, 0, 0.15 * mean(X %*% B)), P, N)  # noise
Y[Y < 0] <- 0

dim(Y)
image(1:N, 1:P, t(Y), xlab = "sample", ylab = "feature",
      main = "Synthetic data (true rank = 3)")
```

## 1. Integrated diagnostics: `nmfkc.rank`

`nmfkc.rank()` fits each rank and draws three curves on one plot:
**R-squared** (red, with a "Best (Elbow)" marker), the broken-stick-corrected
**effective rank** index (green, shown for context), and the element-wise
**ECV error** (blue, with a "Best (Min)" marker).  The recommended rank is the
ECV minimum (or the R-squared elbow).

```{r rank, fig.width = 7, fig.height = 6}
rk <- nmfkc.rank(Y, rank = 1:6)
rk$rank.best                       # recommended rank (ECV minimum)
round(rk$criteria, 3)
```

## 2. Cross-validation: `nmfkc.ecv` and `nmfkc.bicv`

Both are predictive engines that return the held-out error per rank; the
recommended rank minimises it.  `nmfkc.ecv` holds out scattered *entries*
(Wold's CV); `nmfkc.bicv` holds out a *block* of rows and columns at once
(Owen & Perry).

```{r cv}
ev <- nmfkc.ecv (Y, rank = 1:6)
bv <- nmfkc.bicv(Y, rank = 1:6)   # nfolds = 2 (Owen & Perry) by default
data.frame(rank = 1:6,
           sigma.ecv  = round(ev$sigma, 3),
           sigma.bicv = round(bv$sigma, 3))
cat("ECV  picks rank", which.min(ev$sigma),
    "| bi-CV picks rank", bv$rank[which.min(bv$sigma)], "\n")
```

## 3. Clustering stability: `nmfkc.consensus`

For each rank, NMF is run many times from random initialisations and the
reproducibility of the sample clustering is summarised by `cophenetic`
(Brunet), `dispersion` (Kim & Park; higher = crisper) and `pac` (Senbabaoglu;
lower = less ambiguous).

```{r consensus, fig.width = 7, fig.height = 5}
cs <- nmfkc.consensus(Y, rank = 2:6, nrun = 20, keep.consensus = TRUE)
cs
plot(cs)                               # stability curves
```

Each panel is the consensus matrix at one rank, reordered by hierarchical
clustering (blue = 0 / different cluster, red = 1 / same cluster):

```{r consensus-heatmap, fig.width = 8, fig.height = 6}
plot(cs, type = "heatmap")   # all ranks
```

A striking point: the consensus stays essentially **three crisp blocks even at
ranks 4--6**.  The surplus basis vectors do **not** create new *reproducible*
clusters --- at ranks 4--5 they stay near-empty (the hard clustering still uses
only three factors), and at rank 6 a cluster splits only *erratically* across
restarts, which averages out (just a few intermediate, pink entries, and a small
drop in `dispersion` / rise in `PAC`).  So consensus reveals the genuine stable
structure (`3`) robustly, **even when the rank is over-specified** --- a useful
property.  (Use `plot(cs, type = "heatmap", rank = 3)` to enlarge a single rank
with sample labels.)

## 4. Bayesian ARD: `nmfkc.ard`

ARD fits NMF **once** at an over-complete rank and prunes unneeded components
automatically; the relevance bar plot is read like a scree plot (look for the
knee).  Because it is a sensitive point estimate, we aggregate over several
random restarts (`nrun`).

```{r ard, fig.width = 6.5, fig.height = 4.5}
ar <- nmfkc.ard(Y, rank = 8, nrun = 20)   # >=10-20 restarts for a stable mode
ar                                        # see "rank over runs" for confidence
plot(ar)
```

## Summary: did they recover the true rank?

```{r summary}
data.frame(
  method = c("nmfkc.rank (ECV)", "nmfkc.ecv", "nmfkc.bicv",
             "nmfkc.consensus (dispersion)", "nmfkc.consensus (PAC)",
             "nmfkc.ard"),
  estimate = c(rk$rank.best,
               which.min(ev$sigma),
               bv$rank[which.min(bv$sigma)],
               cs$rank[which.max(cs$dispersion)],
               cs$rank[which.min(cs$pac)],
               ar$rank),
  true = r_true
)
```

## How the clustering evolves with rank

A complementary view: `nmf.cluster.flow()` draws a Sankey / alluvial diagram of
how the hard sample clustering changes as the rank grows.  We fit ranks `1:6`
(so the list index equals the rank) and colour the flows by the clustering at
the **true rank**, `reference = 3`.  The adjacent-rank ARI (agreement) is printed
along the top.

```{r flow, fig.width = 8, fig.height = 6}
fits <- lapply(1:6, function(q) nmfkc(Y, Q = q, print.dims = FALSE))
fl <- nmf.cluster.flow(fits, reference = 3)
head(fl$clusters)      # rows = individuals, columns = rank, entries = cluster id
```

At `rank = 1` all 60 samples form a single group; at `rank = 3` they split
cleanly into the three true clusters (20 each); beyond that the groups
over-split (the ARI to the next rank starts to fall) --- visual confirmation
that `3` is the right number of basis vectors here.

## Remarks

On this clean dataset with a clear rank-3 structure, **almost every criterion
recovers the true rank**: the predictive cross-validations (`nmfkc.ecv`,
`nmfkc.bicv`), the integrated `nmfkc.rank`, the consensus `dispersion` (with a
crisp three-block consensus matrix) and ARD (every `nrun` restart agrees on
`3`, relevance dropping to zero after the third component) all give `3`.  Only
`PAC` leans to the coarser rank `2` --- a small hint of the low-rank bias of
stability measures.

On **real** data the criteria often *disagree* much more: stability/consensus
tends to favour low (coarse) ranks, and ARD's count can track the starting rank
when the energy spectrum decays smoothly (no clear knee).  A robust practical
recipe is to **triangulate**:
use ECV for the predictive upper bound, the effective-rank / relevance *knee*
for the lower bound, and the consensus heatmap to confirm an interpretable block
structure --- then choose one rank by purpose and domain knowledge.

See `?nmfkc.rank`, `?nmfkc.ecv`, `?nmfkc.bicv`, `?nmfkc.consensus` and
`?nmfkc.ard` (all cross-referenced) for details.
