---
title: "diagFDR: MaxQuant diagnostics from msms.txt"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{diagFDR: MaxQuant diagnostics from msms.txt}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4.5
)
```

This vignette demonstrates how to compute **diagFDR** diagnostics from a **MaxQuant**
`msms.txt` file.

MaxQuant reports a PSM-level score (Andromeda `Score`) and a decoy indicator (`Reverse`).
`diagFDR` can reconstruct **target–decoy counting (TDC)** q-values from the score and then
compute stability and calibration diagnostics on the **competed PSM universe**.

The typical workflow is:

1. Export `msms.txt` from MaxQuant.
2. Read `msms.txt` into a `dfdr_tbl` (PSM-level) using `read_dfdr_maxquant_msms()`.
3. Run `dfdr_run_all()` and inspect headline/stability/calibration diagnostics.
4. Optionally export a report folder (CSV+PNG+manifest+README+summary) and/or render an HTML report.

## Runnable toy example (no external files required)

We start with a simulated dataset that is similar to the `dfdr_tbl` returned by
`read_dfdr_maxquant_msms()`. Any software producing outputs that can be mapped to
columns `id`, `is_decoy`, `q`, `pep`, `run`, and `score` can similarly be used.

```{r toy}
library(diagFDR)

set.seed(10)

n <- 8000
toy <- data.frame(
  id = as.character(seq_len(n)),
  is_decoy = sample(c(FALSE, TRUE), n, replace = TRUE, prob = c(0.98, 0.02)),
  run = sample(paste0("run", 1:3), n, replace = TRUE),
  score = c(rnorm(n * 0.98, mean = 7, sd = 1), rnorm(n * 0.02, mean = 5, sd = 1))[seq_len(n)],
  pep = NA_real_
)

# Reconstruct q-values by simple TDC from score (for toy data only):
# Here we mimic a typical "higher score is better" setting.
toy <- toy[order(toy$score, decreasing = TRUE), ]
toy$D_cum <- cumsum(toy$is_decoy)
toy$T_cum <- cumsum(!toy$is_decoy)
toy$FDR_hat <- (toy$D_cum + 1) / pmax(toy$T_cum, 1)
toy$q <- rev(cummin(rev(toy$FDR_hat)))
toy <- toy[, c("id","is_decoy","q","pep","run","score")]

x_toy <- as_dfdr_tbl(
  toy,
  unit = "psm",
  scope = "global",
  q_source = "toy TDC from score",
  q_max_export = 1,
  provenance = list(tool = "toy")
)

diag <- dfdr_run_all(xs = list(MaxQuant_PSM = x_toy), alpha_main = 0.01)
```

### Headline stability at 1%

```{r headline}
diag$tables$headline
```

### Decoy tail support and stability proxy

```{r stability-plots}
diag$plots$dalpha
diag$plots$cv
```

### Local boundary support and elasticity

```{r boundary-stability}
diag$plots$dwin
diag$plots$elasticity
```

### Equal-chance plausibility by q-band

```{r equal-chance}
diag$tables$equal_chance_pooled
diag$plots$equal_chance__MaxQuant_PSM
```

## Real MaxQuant workflow (msms.txt)

The code below shows how to run **diagFDR** directly from a MaxQuant `msms.txt` file.

```{r real-maxquant, eval=FALSE}
library(diagFDR)

mq_path <- "path/to/msms.txt"

# Read msms.txt and reconstruct TDC q-values using MaxQuant Score and Reverse indicator.
# - Reverse == "+" is treated as a decoy indicator.
# - Score is assumed "higher is better".
# - q-values are computed using FDR(i) = (D(i)+add_decoy)/T(i) and q(i)=min_{j>=i} FDR(j).
x_mq <- read_dfdr_maxquant_msms(
  path = mq_path,
  pep_mode = "sanitize",          # or "drop" if PEP contains values >1
  exclude_contaminants = TRUE,
  add_decoy = 1L,
  unit = "psm",
  scope = "global",
  provenance = list(tool = "MaxQuant", file = basename(mq_path))
)

# Run diagnostics
diag <- dfdr_run_all(
  xs = list(MaxQuant_PSM = x_mq),
  alpha_main = 0.01,
  alphas = c(1e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2, 1e-1, 2e-1),
  eps = 0.2,
  win_rel = 0.2,
  truncation = "warn_drop",
  low_conf = c(0.2, 0.5)
)

# Inspect headline diagnostics
diag$tables$headline

# Export tables/plots + human-readable summary
dfdr_write_report(
  diag,
  out_dir = "diagFDR_maxquant_out",
  formats = c("csv", "png", "manifest", "readme", "summary")
)

# Optional: render a single HTML report (requires rmarkdown in Suggests)
dfdr_render_report(diag, out_dir = "diagFDR_maxquant_out")


library(diagFDR)

mq_path <- "path/to/msms.txt"

# Read msms.txt and reconstruct TDC q-values
x_mq <- read_dfdr_maxquant_msms(
  path = mq_path,
  pep_mode = "sanitize",
  exclude_contaminants = TRUE,
  add_decoy = 1L,
  unit = "psm",
  scope = "global",
  provenance = list(tool = "MaxQuant", file = basename(mq_path))
)

# Run diagnostics with automatic pseudo p-value computation
# This will use score_to_pvalue(method="decoy_ecdf") internally
diag <- dfdr_run_all(
  xs = list(MaxQuant_PSM = x_mq),
  alpha_main = 0.01,
  compute_pseudo_pvalues = TRUE,   # generates pseudo p-values from scores
  pseudo_pvalue_method = "decoy_ecdf",  # Most defensible method for arbitrary scores
  p_stratify = "run"  # Optional: stratify p-value diagnostics by run
)

# diag will contain:
# - All standard target-decoy diagnostics
# - PLUS p-value calibration plots and tables
# - PLUS Storey pi0 diagnostics
# - PLUS BH comparison diagnostics

# Inspect headline + p-value calibration
diag$tables$headline
diag$tables$p_calibration_summary

# Export everything
dfdr_write_report(
  diag,
  out_dir = "diagFDR_maxquant_out",
  formats = c("csv", "png", "manifest", "readme", "summary")
)

# Render HTML report (will include p-value diagnostics section)
dfdr_render_report(diag, out_dir = "diagFDR_maxquant_out")
```
##Interpretation of pseudo p-value diagnostics

When compute_pseudo_pvalues = TRUE with method = "decoy_ecdf":

    Pseudo p-values are computed as right-tail probabilities under the decoy score distribution
    π₀ estimate: Estimates the proportion of true nulls; should be stable across λ
    BH comparison: Compares BH-FDR to TDA-FDR; agreement supports consistent FDR control

Important: These are diagnostic pseudo p-values, not guaranteed null p-values. They provide:

    ✓ Additional calibration perspectives
    ✓ Cross-validation between BH and TDA procedures
    ✓ Stratified stability checks (e.g., by run)

Caveat: Pseudo p-value calibration cannot detect scoring pathologies that affect both targets and decoys equally.


## Optional: other way to get p-value / pseudo-p-value diagnostics from MaxQuant scores

MaxQuant `Score` is not a p-value. However, you can construct **pseudo-p-values**
to run p-value-based calibration and BH-stability diagnostics. The most defensible option
for arbitrary scores is to use the empirical decoy null tail (`method = "decoy_ecdf"`),
which maps scores to right-tail probabilities under the decoy distribution.

```{r pseudo-p, eval=FALSE}
# Create pseudo-p-values from the decoy score tail and rerun diagnostics with p-value checks.
x_mq$p <- score_to_pvalue(
  score = x_mq$score,
  method = "decoy_ecdf",
  is_decoy = x_mq$is_decoy
)
attr(x_mq, "meta")$p_source <- "score_to_pvalue(method='decoy_ecdf' on MaxQuant Score)"

diag_p <- dfdr_run_all(
  xs = list(MaxQuant_PSM = x_mq),
  alpha_main = 0.01,
  p_stratify = "run"   # optional stratification if run column is meaningful
)

dfdr_write_report(diag_p, out_dir = "diagFDR_maxquant_out_with_p", formats = c("csv","png","manifest","readme","summary"))
dfdr_render_report(diag_p, out_dir = "diagFDR_maxquant_out_with_p")
```

## Interpretation notes

- **Universe / scope**: `read_dfdr_maxquant_msms()` produces a PSM-level universe; q-values
  are reconstructed by score-based TDC. Protein-level claims require additional inference.

- **Stability**: `D_alpha`, `CV_hat`, and `D_alpha_win` quantify whether the operating cutoff
  is well supported by decoys. Very strict cutoffs can yield sparse decoy tails.

- **Calibration**: equal-chance diagnostics are internal plausibility checks under the target–decoy model.
  Agreement with expectations supports (but does not prove) correct FDR control.

- **Pseudo-p-values** derived from the decoy tail can be used for additional calibration/stability diagnostics,
  but should be interpreted as diagnostic tools unless null calibration is theoretically guaranteed.
