---
title: "Hypothesis Testing on Fitted Models"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Hypothesis Testing on Fitted Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4
)
```

## Why a Separate Vignette?

`vignette("maskedhaz")` establishes that `fit(model)(df, par)` returns a `fisher_mle` object, which realises both the `mle_fit` concept (from `algebraic.mle`) and the `algebraic.dist` distribution interface. That gives us `coef`, `vcov`, `confint`, `se`, `bias`, `observed_fim`, `score_val`, and a `logLik` method, all through generics the `maskedhaz` package itself never implemented.

This vignette takes the next step: **everything `hypothesize` does works on these fits directly.** No bridge code, no wrappers, just generics flowing through generics. We'll work through five questions a statistician would actually ask about a real analysis, using a different test machinery for each.

```{r libs}
library(maskedhaz)
library(hypothesize)
library(algebraic.mle)
```

## The Scenario

We generate data from a 2-component Weibull series system (true shapes are 2, not 1) and pretend we don't know the true family. We'll fit both an exponential-series model (simpler, 2 params) and a Weibull-series model (fuller, 4 params) to the same data, then use hypothesis tests to decide which is justified.

```{r data}
set.seed(42)

true_model <- dfr_series_md(components = list(
  dfr_weibull(shape = 2, scale = 5),
  dfr_weibull(shape = 2, scale = 8)
))
df <- rdata(true_model)(theta = c(2, 5, 2, 8), n = 300, tau = 10, p = 0.3)
table(df$omega)
```

```{r fit-both}
null_model <- dfr_series_md(
  components = list(dfr_exponential(), dfr_exponential()),
  n_par = c(1L, 1L)
)
alt_model <- dfr_series_md(
  components = list(dfr_weibull(), dfr_weibull()),
  n_par = c(2L, 2L)
)

fit_null <- suppressWarnings(fit(null_model)(df, par = c(0.1, 0.1)))
fit_alt  <- suppressWarnings(fit(alt_model)(df, par = c(1.5, 4, 1.5, 6)))

coef(fit_null)
coef(fit_alt)
```

## Question 1: Is the Exponential Family Adequate?

The exponential-series model is nested inside the Weibull-series model (it's the boundary case where all shapes equal 1). A likelihood ratio test compares the two.

```{r lrt}
# lrt() accepts logLik objects directly and derives the dof difference
# from the `df` attribute attached by likelihood.model::logLik.fisher_mle
lrt_result <- lrt(logLik(fit_null), logLik(fit_alt))
lrt_result
```

The test statistic is $2(\ell_{\text{alt}} - \ell_{\text{null}})$, asymptotically $\chi^2$ with dof equal to the parameter-count difference (here, 2 extra Weibull shape parameters). The p-value is vanishingly small — the exponential family is rejected decisively.

Notice that we never wrote any maskedhaz-specific glue code. `logLik()` is a standard R generic, `hypothesize::lrt()` consumes its output. The `maskedhaz` package didn't have to know anything about `hypothesize`, and vice versa.

## Question 2: Is Each Individual Shape Significantly Different from 1?

The LRT rejects the global hypothesis "both shapes equal 1 simultaneously," but says nothing about each shape separately. For that we use Wald tests — one per shape parameter — feeding them the estimate and standard error directly.

```{r wald-per-param}
est <- coef(fit_alt)
se_vec <- se(fit_alt)         # from algebraic.mle

# Parameter layout: (shape1, scale1, shape2, scale2) = positions 1, 2, 3, 4
w_shape1 <- wald_test(est[1], se = se_vec[1], null_value = 1)
w_shape2 <- wald_test(est[3], se = se_vec[3], null_value = 1)

c(shape1_pval = pval(w_shape1),
  shape2_pval = pval(w_shape2))
```

Both individual tests reject `shape = 1` at any conventional level. The MLEs are about 1.8 and 2.3, the standard errors are small enough to distinguish both from 1 cleanly.

## Question 3: Controlling Family-Wise Error

If we plan to report these two tests as a family (making decisions only when the family-wise error rate is bounded), we should adjust. `adjust_pval()` accepts a list of tests and returns a list of tests with the corrected p-values stored.

```{r adjust}
adjusted <- adjust_pval(list(w_shape1, w_shape2), method = "bonferroni")
c(shape1_adj = pval(adjusted[[1]]),
  shape2_adj = pval(adjusted[[2]]))
```

With only two tests, Bonferroni simply doubles each p-value; both remain far below 0.05. The point is that `adjust_pval()` returns `hypothesis_test` objects — they continue to compose with the rest of the algebra. You could hand these adjusted tests into `is_significant_at()`, `invert_test()`, or anything else that expects a test.

## Question 4: Composite Hypotheses

A different but related question: is the *conjunction* "shape1 = 1 AND shape2 = 1" supported by the data? Two natural ways to build a test:

**Intersection-union via max p-value**. `intersection_test()` takes the max of the individual p-values. If the max is small, both individual tests rejected, so the intersection rejects too.

```{r intersection}
comp_intersection <- intersection_test(w_shape1, w_shape2)
pval(comp_intersection)
```

**Fisher's method for combining independent p-values**. Under the conjunction null, `-2 \sum \log p_i` follows $\chi^2_{2k}$. `fisher_combine()` implements this.

```{r fisher-combine}
comp_fisher <- fisher_combine(w_shape1, w_shape2)
comp_fisher
```

Both agree that the data contradicts the composite exponential null, but with different levels of sensitivity (Fisher's method uses more of the information in moderate p-values; max-p is conservative when some tests are strong and others weak).

## Question 5: A Score Test Without Refitting

The LRT and Wald tests both required fitting the full Weibull model. A **score test** can evaluate a null hypothesis using only the null model and the full-model score/Hessian machinery — no alternative-fit needed. This is useful when the alternative is expensive to fit or not globally identified.

For the composite null "all shapes = 1, scales from the null MLE," we evaluate the score and observed Fisher information of the Weibull-series model at the null parameter vector.

```{r score-test}
# Map the exp-series MLE into the Weibull parameter space:
# (shape = 1, scale = 1/rate) for each component
null_in_alt <- c(1, 1/coef(fit_null)[1],
                 1, 1/coef(fit_null)[2])

# Score and observed Fisher info of the ALT model at the null point
s_at_null <- score(alt_model)(df, par = null_in_alt)
I_at_null <- -hess_loglik(alt_model)(df, par = null_in_alt)

s_at_null

sc <- score_test(s_at_null, I_at_null, null_value = null_in_alt)
sc
```

Two things worth pointing out. First, the test statistic $s^\top I^{-1} s$ and its $\chi^2_4$ reference distribution come out large — the null is rejected strongly, agreeing with the LRT and Wald results. Second, look at the *components* of the score vector: positions 1 and 3 (the shape entries) are far from zero, while positions 2 and 4 (the scales) are nearly zero. The likelihood gradient at the null "wants" to move shape up, not scale; the null MLE already got the scales roughly right for a shape-1 fit. That's diagnostic information the LRT doesn't give you for free.

## Question 6: Confidence Intervals by Test Inversion

A $100(1-\alpha)\%$ confidence set for a parameter is, by definition, the set of null values the test *fails* to reject at level $\alpha$. `invert_test()` automates this inversion over a user-supplied grid.

```{r invert-test}
# Build a Wald test parameterised by shape1's null value
wald_for_shape1 <- function(theta0) {
  wald_test(est[1], se = se_vec[1], null_value = theta0)
}

ci_shape1 <- invert_test(wald_for_shape1, grid = seq(1.0, 3.0, by = 0.01), alpha = 0.05)
c(lower = lower(ci_shape1), upper = upper(ci_shape1))
```

Compare with the direct Wald CI:

```{r direct-ci}
confint(wald_test(est[1], se = se_vec[1], null_value = est[1]))
```

They agree to within the grid resolution. For a Wald test this redundancy is by design (the two are algebraically identical). The real power of `invert_test()` shows when you plug in a *different* test-fn — e.g., a likelihood-ratio test with a profile likelihood — to get a **non-Wald** confidence region whose shape is driven by the actual likelihood surface rather than the quadratic approximation. That's expensive (requires refitting under each grid point) but fully composable: the test-fn just has to return a `hypothesis_test`.

## Cross-Implementation Check

Because `maskedhaz` and `maskedcauses` both implement the `series_md` protocol and both return `fisher_mle` objects, the hypothesis-testing code is literally *identical* across them. We can verify by fitting the same data with `maskedcauses`' closed-form exponential-series likelihood and running the same LRT.

```{r cross-check, eval = requireNamespace("maskedcauses", quietly = TRUE)}
library(maskedcauses)

fit_null_mc <- suppressWarnings(
  likelihood.model::fit(exp_series_md_c1_c2_c3())(df, par = c(0.1, 0.1))
)

# LRT using maskedcauses null fit + maskedhaz alt fit, across two packages
lrt_cross <- lrt(logLik(fit_null_mc), logLik(fit_alt))
c(
  lrt_same_package = pval(lrt_result),
  lrt_cross_package = pval(lrt_cross)
)
```

The p-values agree. The test statistic, dof, and conclusion are the same because both packages compute the same likelihood (just via different numerical strategies). `hypothesize` doesn't know or care which package the logLik came from — it only needs the `logLik` generic with the `df` attribute. That's the protocol in action.

## Summary

We ran six distinct hypothesis-testing analyses on the same fitted model, each using a different test machinery:

| Question | Test | `hypothesize` function | Input from `maskedhaz` fit |
|----------|------|-----------------------|---------------------------|
| Is the simpler model enough? | LRT | `lrt()` | `logLik(fit_null)`, `logLik(fit_alt)` |
| Is each parameter different from the null value? | Wald | `wald_test()` | `coef(fit)`, `se(fit)` |
| Family-wise error control | Multiple testing | `adjust_pval()` | list of Wald tests |
| Is the composite null rejected? | Intersection-union, Fisher | `intersection_test()`, `fisher_combine()` | individual tests |
| Alt-model-free null rejection | Score | `score_test()` | `score()`, `hess_loglik()` from the alt model at the null point |
| Confidence interval as rejection set | Test inversion | `invert_test()` | a test-fn parameterised by the null value |

None of this is `maskedhaz`-specific. Everything works through the `logLik` / `coef` / `vcov` / `se` / `score_val` / `observed_fim` generics that `algebraic.mle` and `likelihood.model` wired up long before `hypothesize` existed. The layered design lets each package do its one job well; capabilities compose transparently.