---
title: "Designing a Magenta Book evaluation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Designing a Magenta Book evaluation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(magentabook)
```

This vignette walks through the four canonical Magenta Book stages for a worked
example: a hypothetical GBP 50m skills programme aimed at increasing employment
among long-term unemployed claimants. We move from theory of change to evaluation
plan to power calculation to confidence rating, all in one R session.

## Stage 1: theory of change

The theory of change links inputs through to long-run impact. `mb_theory_of_change()`
captures the five canonical Magenta Book levels plus assumptions and external
factors.

```{r}
toc <- mb_theory_of_change(
  inputs     = c("GBP 50m grant", "12 FTE programme team",
                 "Partnership with Jobcentre Plus"),
  activities = c("Design training curriculum",
                 "Deliver workshops in 50 sites",
                 "Provide ongoing mentoring"),
  outputs    = c("500 workshops delivered",
                 "8000 attendees",
                 "5000 completed mentoring blocks"),
  outcomes   = c("Improved employability skills",
                 "Increased job-search confidence",
                 "Higher application rates"),
  impact     = "Higher 12-month employment among long-term unemployed",
  assumptions = c(
    "Workshops cause skills uplift (not just selection of motivated attendees)",
    "Skills uplift translates into application behaviour",
    "Local labour markets absorb the additional applicants"
  ),
  external_factors = c(
    "Macro labour market remains broadly stable",
    "No competing employability programme launches in same areas"
  ),
  name = "Skills uplift programme"
)
toc
```

Pivoting to a logframe with indicators, means of verification, and risks:

```{r}
mb_logframe(
  toc,
  indicators = list(
    outputs  = c("Workshops delivered", "Attendees per workshop"),
    outcomes = c("Skills score (post)", "Application count"),
    impact   = "Employment rate at 12 months"
  ),
  mov = list(
    outputs  = "Programme delivery log",
    outcomes = c("Pre/post survey", "DWP admin data"),
    impact   = "Linked HMRC PAYE records"
  ),
  risks = list(
    outputs  = "Attendance below planned levels",
    outcomes = "Self-report bias in skills score",
    impact   = "Macro shock confounds the estimate"
  )
)
```

The high-criticality assumptions belong in a separate register:

```{r}
mb_assumptions(
  level = c("activities", "outcomes", "impact"),
  description = c(
    "Workshops are well-attended",
    "Skills uplift translates into job entry",
    "Employment rise persists at 12 months"
  ),
  evidence = c(
    "Pilot attendance was 80%",
    "Indirect: similar programmes show 0.3 SD effect",
    "Limited evidence on longer-run persistence"
  ),
  criticality = c("medium", "high", "high")
)
```

## Stage 2: evaluation plan

Tag the evaluation questions by Magenta Book type:

```{r}
qs <- mb_questions(
  text = c(
    "Did the programme cause higher 12-month employment",
    "How large is the effect, and for whom",
    "Was delivery faithful to the design",
    "What was the cost per additional job"
  ),
  type     = c("impact", "impact", "process", "economic"),
  priority = c("primary", "secondary", "secondary", "primary")
)
qs
```

Pin down the counterfactual:

```{r}
cf <- mb_counterfactual(
  definition  = "Eligible non-applicants matched on age, prior unemployment duration, and region",
  source      = "quasi-experimental",
  credibility = "Moderate; selection on observables only, but rich admin covariates available"
)
cf
```

Map stakeholders for governance:

```{r}
mb_stakeholders(
  name = c("HM Treasury", "DWP", "Local authorities", "What Works Centre"),
  role = c("Funder", "Policy lead", "Delivery", "Synthesis"),
  raci = c("A", "R", "C", "I"),
  interest  = c(5, 5, 4, 3),
  influence = c(5, 5, 3, 2)
)
```

Bundle into a plan:

```{r}
plan <- mb_evaluation_plan(
  scope = "GBP 50m programme, 50 sites, 2026-2029",
  questions = qs,
  methods = c(
    impact   = "Difference-in-differences with matched comparison group",
    process  = "Mixed-methods implementation review",
    economic = "Cost per job, with QALY-adjusted variant"
  ),
  timing = c(baseline = "2026-Q1", midline = "2027-Q4", endline = "2029-Q2"),
  governance = "Joint HMT / DWP steering group; peer review by What Works Centre",
  budget = 1.5e6
)
plan
```

## Stage 3: power and sample size

The Magenta Book stresses that an evaluation is only worth running if it can
detect effects of policy-relevant size. We size the study assuming a target
detectable effect of 5 percentage points on the employment rate, baseline
employment of 30 percent, and 80 percent power.

Naive (individual-level) sample size:

```{r}
mb_sample_size(
  type = "proportion", p1 = 0.30, p2 = 0.35,
  power = 0.8, alpha = 0.05
)
```

But the programme is delivered in clusters (sites), so we need to inflate by
the design effect. Jobcentre-level outcomes have an ICC around 0.04 (per the
bundled DWP reference values):

```{r}
mb_icc_reference("employment")
mb_cluster_design(individuals_per_cluster = 50, icc = 0.04, n_clusters = 25)
```

The design effect is a meaningful uplift; we would need roughly that multiple
of the naive N per arm. Alternatively, a stepped-wedge design could trade a
larger total N for staggered rollout that fits programme delivery:

```{r}
mb_stepped_wedge(
  steps = 5, clusters_per_step = 5,
  individuals_per_cluster = 50, icc = 0.04
)
```

What is the smallest effect we can detect with the planned design?

```{r}
mb_mde(
  n_per_group = 600, type = "proportion",
  baseline = 0.30, power = 0.8
)
```

## Stage 4: rate the evidence

Once the evaluation has run, score it on the Maryland SMS:

```{r}
sms <- mb_sms_rate(
  level  = 4,
  study  = "Smith et al. (2029) Skills uplift evaluation",
  design = "Difference-in-differences with matched comparison",
  notes  = "Parallel trends supported by 4 pre-period observations; cluster-robust SEs"
)
sms
```

Record a structured confidence rating:

```{r}
conf_main <- mb_confidence(
  rating                 = "medium",
  question               = "Did the programme raise 12-month employment",
  evidence_strength      = "One Level 4 DiD (n = 12000); supportive Level 3 cohort study",
  methodological_quality = "Adequate; parallel trends plausible; some attrition concerns",
  generalisability       = "Established across 50 sites in two regions",
  rationale              = "Effect direction consistent across two studies but limited replication outside the programme footprint"
)
conf_main

conf_process <- mb_confidence(
  rating                 = "high",
  question               = "Was the programme implemented faithfully",
  evidence_strength      = "Mixed-methods process evaluation; 50-site fidelity audit",
  methodological_quality = "Strong; documented fidelity protocol with inter-rater reliability",
  generalisability       = "All sites covered",
  rationale              = "Comprehensive coverage; consistent fidelity scores"
)

mb_confidence_summary(conf_main, conf_process)
```

## Bringing it together

A single `mb_report` object aggregates everything:

```{r}
report <- mb_evaluation_report(
  plan       = plan,
  toc        = toc,
  sms        = sms,
  confidence = list(conf_main, conf_process),
  name       = "Skills uplift evaluation"
)
report
```

Export to LaTeX for a one-pager:

```{r}
cat(mb_to_latex(report, caption = "Skills uplift evaluation summary"))
```

Word and Excel exports are available via `mb_to_word()` and `mb_to_excel()`
(both require optional packages: `officer` + `flextable`, and `openxlsx`
respectively).

## Reproducibility

Every result object stamps the package vintage. Bundled rubric and reference
tables expose their source via `mb_data_versions()`:

```{r}
mb_data_versions()
```