---
title: "Data-Parallel Training"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data-Parallel Training}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = TRUE)
```

ggmlR provides `dp_train()` for data-parallel training across multiple GPUs
(or CPU cores).  Each replica processes a unique sample per step; gradients
are averaged and applied once.

```{r}
library(ggmlR)
```

---

## 1. Concept

`dp_train()` takes a **model factory** (`make_model`) instead of a model
instance.  It creates `n_gpu` identical replicas, synchronises their initial
weights, and runs a gradient-accumulation loop:

```
for each iteration:
  each replica  →  forward(sample_i)  →  loss  →  backward
  average gradients across replicas
  optimizer step on replica 0
  broadcast updated weights to all replicas
```

The effective batch size equals `n_gpu` (one sample per replica per step).

---

## 2. Minimal example

```{r}
data(iris)
set.seed(42)

x_cm <- t(scale(as.matrix(iris[, 1:4])))    # [4, 150]
y_oh <- t(model.matrix(~ Species - 1, iris)) # [3, 150]

# Dataset as list of (x, y) pairs — one sample each
dp_data <- lapply(seq_len(ncol(x_cm)), function(i)
  list(x = x_cm[, i, drop = FALSE],
       y = y_oh[, i, drop = FALSE]))

# Model factory — called once per replica
make_model <- function() {
  ag_sequential(
    ag_linear(4L, 32L, activation = "relu"),
    ag_linear(32L, 3L)
  )
}

result <- dp_train(
  make_model = make_model,
  data       = dp_data,
  loss_fn    = function(out, tgt) ag_softmax_cross_entropy_loss(out, tgt),
  forward_fn = function(model, s)  model$forward(ag_tensor(s$x)),
  target_fn  = function(s)         s$y,
  n_gpu      = 1L,         # set to ggml_vulkan_device_count() for multi-GPU
  n_iter     = 2000L,
  lr         = 1e-3,
  verbose    = TRUE
)

cat("Final loss:", result$loss, "\n")
model <- result$model
```

---

## 3. Multi-GPU

```{r}
n_gpu <- max(1L, ggml_vulkan_device_count())
cat(sprintf("Training on %d GPU(s)\n", n_gpu))

result_mg <- dp_train(
  make_model = make_model,
  data       = dp_data,
  loss_fn    = function(out, tgt) ag_softmax_cross_entropy_loss(out, tgt),
  forward_fn = function(model, s)  model$forward(ag_tensor(s$x)),
  target_fn  = function(s)         s$y,
  n_gpu      = n_gpu,
  n_iter     = 2000L,
  lr         = 1e-3,
  max_norm   = 5.0,      # gradient clipping
  verbose    = FALSE
)
```

With `n_gpu = 2` the effective batch is 2 and training is ~2x faster (ignoring
communication overhead).

---

## 4. Gradient clipping

Pass `max_norm` to clip the global gradient norm before each optimizer step:

```{r}
result <- dp_train(
  make_model = make_model,
  data       = dp_data,
  loss_fn    = function(out, tgt) ag_softmax_cross_entropy_loss(out, tgt),
  forward_fn = function(model, s)  model$forward(ag_tensor(s$x)),
  target_fn  = function(s)         s$y,
  n_gpu      = 1L,
  n_iter     = 2000L,
  lr         = 1e-3,
  max_norm   = 1.0       # clip to unit norm
)
```

---

## 5. `ag_dataloader` — batched training loop

For standard single-process batched training `ag_dataloader` is simpler than
`dp_train`:

```{r}
x_tr <- x_cm[, 1:120];  y_tr <- y_oh[, 1:120]

dl <- ag_dataloader(x_tr, y_tr, batch_size = 32L, shuffle = TRUE)

model2  <- make_model()
params2 <- model2$parameters()
opt2    <- optimizer_adam(params2, lr = 1e-3)

ag_train(model2)
for (ep in seq_len(100L)) {
  for (batch in dl$epoch()) {
    with_grad_tape({
      loss <- ag_softmax_cross_entropy_loss(
        model2$forward(batch$x), batch$y$data)
    })
    grads <- backward(loss)
    opt2$step(grads);  opt2$zero_grad()
  }
}
```

---

## 6. Full example

A detailed example with synthetic regression, multiple replica counts, and
correctness checks:

```{r}
# inst/examples/dp_train_demo.R
```

Multi-GPU scheduler usage (low level):

```{r}
# inst/examples/multi_gpu_example.R
```