---
title: "The idionomics Pipeline: keeping individual level information to find functionally relevant principles"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{The idionomics Pipeline: keeping individual level information to find functionally relevant principles}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  message  = TRUE,
  warning  = FALSE
)
```

## The idionomic science principle

Classical longitudinal methods (e.g., multilevel models) estimate one set of parameters shared — or partially shared — across all units of an ensemble. If the focus of interest is the trajectory of individuals, this is only sensible under hard-to-meet assumptions, such as exchangeability and/or ergodicity. If these assumptions are not met, ensemble averages may systematically obscure individual differences: an average positive effect may coexist with a significant subset of individuals for whom the effect is negative, nonsignificant, or zero.

Idionomic science inverts the order of operations:

1. **Unit first.** Fit a model to each person's time series independently, capturing that person's unique dynamics, residual structure, and effect sizes.
2. **Group later.** Aggregate the individual estimates with meta-analytic methods that explicitly represent and quantify heterogeneity across people.

The recommended pipeline is:

```
i_screener()  →  pmstandardize()  →  i_detrender()  →  iarimax()/looping_machine()  →  i_pval() / sden_test()
```

---

## Synthetic data

This vignette walks through each step with a synthetic panel of 12 subjects, each contributing 50 time points. The data includes four variables (`a`, `b`, `c`, `y`) with known relationships, plus three manually created subjects that demonstrate specific pipeline behaviors.

```{r load}
library(idionomics)

set.seed(42)
panel <- do.call(rbind, lapply(1:9, function(id) {
  a <- rnorm(50)
  b <- 0.4 * a + rnorm(50)
  c <- 0.4 * b + rnorm(50)
  data.frame(id = as.character(id), time = seq_len(50),
             a = a, b = b, c = c,
             y = 0.5 * a + rnorm(50),
             stringsAsFactors = FALSE)
}))

# Subject 10: near-constant "a" — will be caught by i_screener(min_sd = 0.5)
s10 <- data.frame(id = "10", time = seq_len(50),
                  a = rep(3, 50), b = rnorm(50), c = rnorm(50),
                  y = rnorm(50), stringsAsFactors = FALSE)

# Subject 11: full positive loop (a -> b -> c -> a)
a11 <- rnorm(50)
b11 <- 0.6 * a11 + rnorm(50, sd = 0.5)
c11 <- 0.6 * b11 + rnorm(50, sd = 0.5)
s11 <- data.frame(id = "11", time = seq_len(50),
                  a = a11 + 0.4 * c11, b = b11, c = c11,
                  y = 0.5 * a11 + rnorm(50), stringsAsFactors = FALSE)

# Subject 12: negative a -> y effect
a12 <- rnorm(50)
s12 <- data.frame(id = "12", time = seq_len(50),
                  a = a12, b = 0.4 * a12 + rnorm(50),
                  c = rnorm(50), y = -0.5 * a12 + rnorm(50),
                  stringsAsFactors = FALSE)

panel <- rbind(panel, s10, s11, s12)
```

The data-generating process:

- Subjects 1--9: `a` is independent noise, `b = 0.4*a + noise`, `c = 0.4*b + noise`, `y = 0.5*a + noise`. The chain `a -> b -> c` provides signal for `looping_machine()`, and `a -> y` provides signal for `iarimax()`. The `c -> a` leg has no true signal — the loop should not close for most subjects.
- Subject 10: `a` is constant (`rep(3, 50)`), so it has zero within-person variance. `i_screener()` will catch this.
- Subject 11: all three loop legs have true positive signal, including `c -> a`. This subject should show a positive directed loop.
- Subject 12: the `a -> y` effect is negative (`-0.5`), creating heterogeneity in the meta-analysis.

---

## Step 1: Data quality screening — `i_screener()`

Before standardizing, screen subjects on their raw data. After `pmstandardize()`, all non-constant series have within-person variance = 1 by construction, making variance-based filters ineffective. `i_screener()` catches low-quality subjects at the right stage.

```{r screener}
panel_clean <- i_screener(panel, cols = c("a", "b", "c", "y"), id_var = "id",
                          min_n_subject = 20, min_sd = 0.5, verbose = TRUE)
```

Subject 10 is removed because `a` has SD = 0, which falls below `min_sd = 0.5`.

Three criteria are available (all optional except `min_n_subject`):

| Criterion | What it catches |
|---|---|
| `min_n_subject` | Too few observations |
| `min_sd` | Near-constant series (floor/ceiling responders) |
| `max_mode_pct` | "Stuck" responders (e.g. >= 80% of responses identical) |

Use `mode = "report"` to inspect quality metrics without committing to exclusion:

```{r screener-report}
report <- i_screener(panel, cols = c("a", "b", "c", "y"), id_var = "id",
                     min_n_subject = 20, min_sd = 0.5, max_mode_pct = 0.80,
                     mode = "report", verbose = TRUE)
print(report)
```

---

## Step 2: Within-person standardization — `pmstandardize()`

Person-mean standardization computes `(x - person_mean) / person_sd` for each subject and column. This removes between-person differences in level and scale, making coefficients comparable across subjects. Output columns are named `<col>_psd`.

```{r pmstandardize}
panel_std <- pmstandardize(panel_clean, cols = c("a", "b", "c", "y"), id_var = "id",
                           verbose = TRUE)
head(panel_std[, c("id", "time", "a_psd", "b_psd", "c_psd", "y_psd")])
```

Edge cases are handled automatically: all-NA series remain NA; single-observation series return 0; constant series (zero SD) return 0 and are filtered downstream by `iarimax()`'s `minvar` guard.

---

## Step 3: Linear detrending — `i_detrender()`

`i_detrender()` removes the within-person linear time trend via `lm(col ~ time)` and appends each column with the residuals (`<col>_dt`). This reduces the integration order `auto.arima()` selects, fostering comparability between models.

Filtering is per-column and independent: a subject can produce valid detrended residuals for one variable while receiving `NA` in another if that column fails the variance or observation-count thresholds.

```{r detrender}
panel_dt <- i_detrender(panel_std, cols = c("a_psd", "b_psd", "c_psd", "y_psd"),
                        id_var = "id", timevar = "time", verbose = TRUE)
head(panel_dt[, c("id", "time", "a_psd_dt", "b_psd_dt", "c_psd_dt", "y_psd_dt")])
```

> **Pipeline order matters.** Detrending after standardization is preferred. Reversing the order (`i_detrender()` then `pmstandardize()`) can amplify near-zero residuals to unit variance, bypassing `iarimax()`'s `minvar` filter.

---

## Step 4a: Per-subject ARIMAX and meta-analysis — `iarimax()`

`iarimax()` fits one `forecast::auto.arima()` model per subject, extracts coefficients via `broom::tidy()`, and pools the focal predictor's coefficients with `metafor::rma()` (REML).

```{r iarimax}
result <- iarimax(panel_dt,
                  y_series  = "y_psd_dt",
                  x_series  = "a_psd_dt",
                  id_var    = "id",
                  timevar   = "time",
                  verbose   = TRUE)
```

By default, `auto.arima()` selects the differencing order `d` independently per subject. When `d` varies across subjects, some coefficients describe effects on levels (`d = 0`) while others describe effects on changes (`d = 1`), making them less comparable in the meta-analysis. Use `fixed_d` to impose a common differencing order:

```{r iarimax-fixed-d, eval = FALSE}
# Fix d = 0 for all subjects (coefficients describe effects on levels)
result_d0 <- iarimax(panel_dt, y_series = "y_psd_dt", x_series = "a_psd_dt",
                     id_var = "id", timevar = "time", fixed_d = 0)
```

### Summary

```{r iarimax-summary}
summary(result)
```

### Caterpillar plot

```{r iarimax-plot, fig.width = 6, fig.height = 5}
plot(result,
     y_series_name = "Outcome (y)",
     x_series_name = "Predictor (a)")
```

Each bar is a per-subject confidence interval, colored green (significantly positive), red (significantly negative), or black (crosses zero). The blue band is the 95% CI of the REMA pooled effect. Subject 12 (negative `a -> y` effect) should appear on the negative side.

### Per-subject results

The full individual-level output is in `$results_df`:

```{r results-df}
head(result$results_df[, c("id", "nAR", "nI", "nMA",
                           "estimate_a_psd_dt", "std.error_a_psd_dt",
                           "n_valid", "n_params", "raw_cor")])
```

---

## Step 4b: Directed loop detection — `looping_machine()`

`looping_machine()` tests whether three variables form a directed positive loop (a -> b -> c -> a). It fits three I-ARIMAX legs, applies `i_pval()` to each, and computes `Loop_positive_directed`: a 0/1 indicator that is 1 only when all three focal betas are positive *and* significant at `alpha`.

```{r looping-machine}
loop_result <- looping_machine(panel_dt,
                               a_series = "a_psd_dt",
                               b_series = "b_psd_dt",
                               c_series = "c_psd_dt",
                               id_var   = "id",
                               timevar  = "time",
                               verbose  = TRUE)
```

```{r looping-inspect}
# Proportion of subjects with a detected positive directed loop
mean(loop_result$loop_df$Loop_positive_directed, na.rm = TRUE)

# Per-leg I-ARIMAX results are also accessible
summary(loop_result$iarimax_a_to_b)
```

Because the positive directed loop is a conjunction of three one-sided tests, it is very conservative: under the global null and independence, the false positive rate is approximately $(\alpha/2)^3$.

Optional arguments include `covariates` (shared extra predictors for all three legs), `include_third_as_covariate` (adds the third loop variable as a covariate in each leg), and `fixed_d` (fixes the differencing order across all legs for comparability).

---

## Step 5: Per-subject p-values — `i_pval()`

`i_pval()` attaches a `pval_<feature>` column to `results_df` using the two-tailed t-distribution with ML-based degrees of freedom (`n_valid - n_params`).

```{r i-pval}
result_pval <- i_pval(result)
result_pval$results_df[, c("id", "estimate_a_psd_dt", "pval_a_psd_dt")]
```

---

## Step 6: Sign Divergence / Equisyncratic Null test — `sden_test()`

`sden_test()` runs a binomial test on the count of individually significant effects. Two variants are available and selected automatically based on the REMA pooled effect:

- **ENT** (Equisyncratic Null Test): tests whether any-direction significant effects exceed the rate expected by chance (`alpha_arimax`). Used when the pooled effect is non-significant.
- **SDT** (Sign Divergence Test): tests whether effects in the counter-pooled direction exceed chance (`alpha_arimax / 2`). Used when the pooled effect is significant.

The auto-selection pivot is fixed at p = 0.05, independent of `alpha_arimax`.

```{r sden}
sden <- sden_test(result_pval)
summary(sden)
```

You can also force a specific test:

```{r sden-force}
sden_ent <- sden_test(result, test = "ENT")
summary(sden_ent)
```