Abstract

When performing cross-validation, we tend to go with the common 10
folds (`k=10`

). In this vignette, we try different number of
folds settings and assess the differences in performance. To make our
results robust to this choice, we average the results of different
settings. The functions of interest are
`cross_validate_fn()`

and `groupdata2::fold()`

.

Contact the author at r-pkgs@ludvigolsen.dk

When performing cross-validation, it is common to use 10 folds. Why? It is the common thing to do of course! Not 9 or 11, but 10, and sometimes 5, and sometimes n-1 folds (i.e. leave-one-out cross-validation).

While having a standard setting means one less thing to worry about, let’s spend a few minutes discussing this choice. Whether it is reasonable comes down to the context of course, but these are some general thoughts on the topic:

A higher

`k`

(number of folds) means that each model is trained on a larger training set and tested on a smaller test fold. In theory, this should lead to a*lower*prediction error as the models see more of the available data.A lower

`k`

means that the model is trained on a smaller training set and tested on a larger test fold. Here, the potential for the data distribution in the test fold to differ from the training set is bigger, and we should thus expect a*higher*prediction error*on average*.

In the plot below, some generated data has been split into 3 (left)
and 10 (right) folds. Each line represents the best linear model for one
of the folds (i.e. the model that would have the lowest prediction error
when testing on that fold). When `k=3`

, a single fold with a
highly different distribution from the other two folds can have a big
impact on the cross-validated prediction error. When `k=10`

,
a few of the folds may differ greatly as well, but on average, the model
will be closer to the model that overall reduces the prediction error
the most:

Note that this picture changes with different random seeds. To check
whether the lower number of folds indeed tend to give higher prediction
errors, we run this 100 times and average the results. That is, we
randomly generate 100 datasets and cross-validate a linear model
(`y ~ x`

) on each of them. We then average the RMSE (Root
Mean Square Error) and MAE (Mean Absolute Error) to get the following
results:

```
#> # A tibble: 2 × 3
#> `Fold Column` RMSE MAE
#> <fct> <dbl> <dbl>
#> 1 k = 3 0.976 0.766
#> 2 k = 10 0.902 0.758
```

Both the RMSE and MAE are higher in the `k=3`

setting. As
a matter of fact, this was the case in 91% of the runs. This supports,
that (*on average*) the prediction error should be lower with a
larger `k`

. Let’s see a violin plot of the simulations as
well:

So… Why not just always use the highest possible number of folds?

A higher number of folds means training a lot more models, which can
be computationally heavy and time-consuming. So finding a lower
`k`

that yields a similar prediction error most of the time,
can be very useful. For the rest of this vignette, we won’t go in-depth
with such limited-resources scenarios though.

We might consider whether this even matters? If the goal of our
cross-validation is to compare a set of models and then choose the best
one, what matters the most is whether the same model would be picked
with different settings of `k`

. But how can we make sure that
the same model is selected without trying multiple settings? And if it
is not, which result do we choose (without cherry-picking)?

An approach to minimizing the effect of `k`

on our model
selection could be to to run the cross-validation with multiple
`k`

s and then average the results. In general, repeated
cross-validation (where we average over results from multiple fold
splits) is a great choice when possible, as it is more robust to the
random fold splits. In the next section, we will run repeated
cross-validation with different `k`

settings and plot the
results.

Our goal here is two-fold:

Try multiple values of

`k`

(different numbers of folds) and see the effect on the prediction error.Repeat each scenario multiple times to get more robust results.

Whereas the previous section used a regression example (continuous
y-variable), we will now perform multiclass classification on the
`iris`

dataset. This fairly well-known dataset has three
species of iris flowers with 50 flowers from each species. The
predictors are length and width measurements of the sepals and
petals.

First, we attach the needed packages and set a random seed:

```
library(cvms) # version >= 1.2.2
library(groupdata2) # version >= 1.4.1
library(dplyr)
library(ggplot2)
xpectr::set_test_seed(1)
```

As the fold creation and cross-validation will take some time to run, we can enable parallelization to speed up the processes:

```
# Enable parallelization
# NOTE: Uncomment to run
# library(doParallel)
# doParallel::registerDoParallel(6)
```

Now, we load the data and convert it to a `tibble`

:

```
# Load iris
data("iris")
# Convert iris to a tibble
iris <- dplyr::as_tibble(iris)
iris
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ℹ 140 more rows
```

We count the rows per species, to ensure everything is in order:

```
iris %>%
dplyr::count(Species)
#> # A tibble: 3 × 2
#> Species n
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
```

When creating the folds, we would like to balance them such that the
distribution of the species are similar in all the folds. In
`groupdata2::fold()`

, this is possible with the
`cat_col`

argument. This also ensures that there’s at least 1
of each species in each fold. With this approach, our maximum number of
folds becomes 50. The lowest number of folds we can
*meaningfully* generate is technically 2, but that seems unlikely
in practice, so we will set our lower limit at 3 (arbitrary yes, but I
get to choose here!). We thus pick 10 `k`

s in that range
using the `seq()`

function.

As we are interested in comparing the results at each `k`

setting, we repeat each of the settings 3 times to have more robustness
towards the randomness when splitting. You might want to increase this
to 10 repetitions, but that increases running time too much for this
tutorial. If you are only interested in the average results, you might
not need to repeat each setting, as the multiple settings of
`k`

becomes a kind of repeated cross-validation in
itself.

```
# Generate sequence of `k` settings in the 3-50 range
fold_counts <- round(seq(from = 3, to = 50, length.out = 10))
# Repeat each 3 times
fold_counts <- rep(fold_counts, each = 3)
fold_counts
#> [1] 3 3 3 8 8 8 13 13 13 19 19 19 24 24 24 29 29 29 34 34 34 40 40 40 45
#> [26] 45 45 50 50 50
```

We pass this sequence of counts to the `k`

argument in
`groupdata2::fold()`

. We must also set the
`num_fold_cols`

argument to match the length of our sequence.
As explained previously, we set `cat_col = "Species"`

to
ensure a balanced distribution of the species in all folds. Finally, we
enable parallelization to speed things up:

```
data <- iris %>%
groupdata2::fold(
k = fold_counts,
cat_col = "Species",
num_fold_cols = length(fold_counts), # Must match the length of `k`
parallel = TRUE
)
```

```
data
#> # A tibble: 150 × 35
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species .folds_1 .folds_2
#> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa 1 2
#> 2 4.9 3 1.4 0.2 setosa 2 2
#> 3 4.7 3.2 1.3 0.2 setosa 2 1
#> 4 4.6 3.1 1.5 0.2 setosa 3 2
#> 5 5 3.6 1.4 0.2 setosa 1 2
#> 6 5.4 3.9 1.7 0.4 setosa 3 1
#> 7 4.6 3.4 1.4 0.3 setosa 3 3
#> 8 5 3.4 1.5 0.2 setosa 2 1
#> 9 4.4 2.9 1.4 0.2 setosa 2 1
#> 10 4.9 3.1 1.5 0.1 setosa 1 3
#> # ℹ 140 more rows
#> # ℹ 28 more variables: .folds_3 <fct>, .folds_4 <fct>, .folds_5 <fct>,
#> # .folds_6 <fct>, .folds_7 <fct>, .folds_8 <fct>, .folds_9 <fct>,
#> # .folds_10 <fct>, .folds_11 <fct>, .folds_12 <fct>, .folds_13 <fct>,
#> # .folds_14 <fct>, .folds_15 <fct>, .folds_16 <fct>, .folds_17 <fct>,
#> # .folds_18 <fct>, .folds_19 <fct>, .folds_20 <fct>, .folds_21 <fct>,
#> # .folds_22 <fct>, .folds_23 <fct>, .folds_24 <fct>, .folds_25 <fct>, …
```

We see that the `.folds_*`

columns have been added with
the fold identifiers. We will need the names of the generated fold
columns in a second so here’s my favorite approach to generating them
with `paste0()`

:

```
# Quick way to generate the names of the fold columns
# Note: `seq_along()` is equal to `1:length(fold_counts)`
fold_columns <- paste0(".folds_", seq_along(fold_counts))
fold_columns
#> [1] ".folds_1" ".folds_2" ".folds_3" ".folds_4" ".folds_5" ".folds_6"
#> [7] ".folds_7" ".folds_8" ".folds_9" ".folds_10" ".folds_11" ".folds_12"
#> [13] ".folds_13" ".folds_14" ".folds_15" ".folds_16" ".folds_17" ".folds_18"
#> [19] ".folds_19" ".folds_20" ".folds_21" ".folds_22" ".folds_23" ".folds_24"
#> [25] ".folds_25" ".folds_26" ".folds_27" ".folds_28" ".folds_29" ".folds_30"
```

`groupdata2`

has the tool
`summarize_group_cols()`

for inspecting the generated fold
columns (and factors in general). We can use this to assure ourselves
that the right number of folds were created in each of the fold
columns:

```
fold_stats <- groupdata2::summarize_group_cols(
data = data,
group_cols = fold_columns
) %>%
rename(`Fold Column` = `Group Column`)
# View fold column statistics
# We only look at one fold column per `k` setting
fold_stats %>%
dplyr::group_by(`Num Groups`) %>%
dplyr::filter(dplyr::row_number() == 1) %>%
knitr::kable()
```

Fold Column | Num Groups | Mean Rows | Median Rows | Std Rows | IQR Rows | Min Rows | Max Rows |
---|---|---|---|---|---|---|---|

.folds_1 | 3 | 50.000000 | 51 | 1.7320508 | 1.50 | 48 | 51 |

.folds_4 | 8 | 18.750000 | 18 | 1.3887301 | 0.75 | 18 | 21 |

.folds_7 | 13 | 11.538462 | 12 | 1.1266014 | 0.00 | 9 | 12 |

.folds_10 | 19 | 7.894737 | 9 | 1.4867839 | 3.00 | 6 | 9 |

.folds_13 | 24 | 6.250000 | 6 | 0.8469896 | 0.00 | 6 | 9 |

.folds_16 | 29 | 5.172414 | 6 | 1.3645765 | 3.00 | 3 | 6 |

.folds_19 | 34 | 4.411765 | 3 | 1.5199212 | 3.00 | 3 | 6 |

.folds_22 | 40 | 3.750000 | 3 | 1.3155870 | 0.75 | 3 | 6 |

.folds_25 | 45 | 3.333333 | 3 | 0.9534626 | 0.00 | 3 | 6 |

.folds_28 | 50 | 3.000000 | 3 | 0.0000000 | 0.00 | 3 | 3 |

Now we are ready to cross-validate a model on our data. We will use
the `e1071::svm()`

Support Vector Machine model function. To
use this with `cross_validate_fn()`

, we can use the included
`model_fn`

and `predict_fn`

functions. We further
need to specify the `kernel`

and `cost`

hyperparameters.

For more elaborate examples of `cross_validate_fn()`

, see
here.

```
# To use the e1071::svm() model function
# we specify the model and predict functions
model_fn <- cvms::model_functions("svm_multinomial")
predict_fn <- cvms::predict_functions("svm_multinomial")
# Specify hyperparameters
hyperparameters <- list('kernel' = 'radial', 'cost' = 10)
```

We define an (arbitrary) set of formulas to cross-validate:

```
formulas <- c(
"Species ~ Sepal.Length + Sepal.Width",
"Species ~ Petal.Length + Petal.Width",
"Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width",
"Species ~ Sepal.Length * Sepal.Width + Petal.Length * Petal.Width"
)
```

Now, we are ready to run the cross-validation! We pass our data,
formulas, functions, hyperparameters and fold column names to
`cross_validate_fn()`

and specify that the type of task is
multiclass classification (i.e. `multinomial`

). We also
enable parallelization.

NOTE: This number of fold columns and formulas requires fitting
*3180* model instances. That can take a few minutes to run,
depending on your computer. Unfortunately, I have not yet found a way to
include a progress bar when running in parallel.

```
cv <- cross_validate_fn(
data = data,
formulas = formulas,
type = "multinomial",
model_fn = model_fn,
predict_fn = predict_fn,
hyperparameters = hyperparameters,
fold_cols = fold_columns,
parallel = TRUE
)
```

```
cv
#> # A tibble: 4 × 26
#> Fixed `Overall Accuracy` `Balanced Accuracy` F1 Sensitivity Specificity
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Sepal.Le… 0.789 0.842 0.789 0.789 0.894
#> 2 Petal.Le… 0.958 0.968 0.958 0.958 0.979
#> 3 Sepal.Le… 0.960 0.970 0.960 0.960 0.980
#> 4 Sepal.Le… 0.974 0.981 0.974 0.974 0.987
#> # ℹ 20 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,
#> # Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <chr>,
#> # `Confusion Matrix` <list>, Results <list>, `Class Level Results` <list>,
#> # Coefficients <list>, Folds <int>, `Fold Columns` <int>,
#> # `Convergence Warnings` <int>, `Other Warnings` <int>,
#> # `Warnings and Messages` <list>, Process <list>, …
```

Above, we see the averaged results from all the fold columns. The
last model formula seems to have performed the best as it has the
highest `Overall Accuracy`

, `Balanced Accuracy`

,
`F1`

, and so on. If our objective was to average the results
with different settings of `k`

to increase robustness to that
choice, we could stop now. In the following section, we will have a look
at the results from the different settings of `k`

.

`k`

There is a list of nested tibbles (data frames) in the
cross-validation output called `Results`

. This has the
results from each fold column. Let’s extract it and format it a bit.

Note: In regression tasks, the `Results`

tibbles would
have the results from *each fold*, from each fold column, but in
classification we gather the predictions from all folds within a fold
column before evaluation.

```
# Extract the fold column results
# This is a list of data frames (one per formula)
fold_column_results <- cv$Results
```

As this is currently a list of tibbles (one for each formula), we
first name it by the model formulas and then combine the tibbles to a
single data frame with `dplyr::bind_rows()`

.

```
# Set the names of the data frames to their respective formula
names(fold_column_results) <- cv$Fixed
# Combine the data frames
# Create a 'Formula' column from the names
fold_column_results <- fold_column_results %>%
dplyr::bind_rows(.id = "Formula")
```

We now have a single tibble where a new column (`Formula`

)
specifies what model the results came from. When plotting, the full
model formula strings are a bit long though, so let’s convert them to
something shorter:

```
# Make the formula string prettier for plotting
# Sepal -> S; Petal -> P; Width -> W; Length -> L
fold_column_results <- fold_column_results %>%
dplyr::mutate(
Formula = gsub(
x = Formula,
pattern = "[^SPWL+*]",
replacement = ""
)
)
```

This leaves us with the following data frame:

```
fold_column_results
#> # A tibble: 120 × 14
#> Formula `Fold Column` `Overall Accuracy` `Balanced Accuracy` F1
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 SL+SW .folds_1 0.773 0.83 0.774
#> 2 SL+SW .folds_2 0.767 0.825 0.768
#> 3 SL+SW .folds_3 0.76 0.82 0.761
#> 4 SL+SW .folds_4 0.78 0.835 0.780
#> 5 SL+SW .folds_5 0.8 0.85 0.801
#> 6 SL+SW .folds_6 0.787 0.84 0.787
#> 7 SL+SW .folds_7 0.78 0.835 0.781
#> 8 SL+SW .folds_8 0.807 0.855 0.806
#> 9 SL+SW .folds_9 0.787 0.84 0.788
#> 10 SL+SW .folds_10 0.787 0.84 0.787
#> # ℹ 110 more rows
#> # ℹ 9 more variables: Sensitivity <dbl>, Specificity <dbl>,
#> # `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>, Kappa <dbl>, MCC <dbl>,
#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>
```

Currently, we lack the number of folds for each of the fold columns.
We have stored those in the fold column statistics we looked at in the
beginning (`fold_stats`

), so let’s add them with
`dplyr::left_join()`

:

```
# Select the `Num Groups` column and rename to `Num Folds`
fold_counts_df <- fold_stats %>%
dplyr::select(`Fold Column`, `Num Groups`) %>%
dplyr::rename(`Num Folds` = `Num Groups`)
fold_counts_df
#> # A tibble: 30 × 2
#> `Fold Column` `Num Folds`
#> <chr> <dbl>
#> 1 .folds_1 3
#> 2 .folds_2 3
#> 3 .folds_3 3
#> 4 .folds_4 8
#> 5 .folds_5 8
#> 6 .folds_6 8
#> 7 .folds_7 13
#> 8 .folds_8 13
#> 9 .folds_9 13
#> 10 .folds_10 19
#> # ℹ 20 more rows
# Add the counts with a join
fold_column_results <- fold_counts_df %>%
dplyr::left_join(fold_column_results, by = "Fold Column")
fold_column_results
#> # A tibble: 120 × 15
#> `Fold Column` `Num Folds` Formula `Overall Accuracy` `Balanced Accuracy`
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 .folds_1 3 SL+SW 0.773 0.83
#> 2 .folds_1 3 PL+PW 0.967 0.975
#> 3 .folds_1 3 SL+SW+PL+PW 0.967 0.975
#> 4 .folds_1 3 SL*SW+PL*PW 0.973 0.98
#> 5 .folds_2 3 SL+SW 0.767 0.825
#> 6 .folds_2 3 PL+PW 0.953 0.965
#> 7 .folds_2 3 SL+SW+PL+PW 0.96 0.97
#> 8 .folds_2 3 SL*SW+PL*PW 0.967 0.975
#> 9 .folds_3 3 SL+SW 0.76 0.82
#> 10 .folds_3 3 PL+PW 0.96 0.97
#> # ℹ 110 more rows
#> # ℹ 10 more variables: F1 <dbl>, Sensitivity <dbl>, Specificity <dbl>,
#> # `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>, Kappa <dbl>, MCC <dbl>,
#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>
```

We further add a column that indicates which repetition of the
`k`

setting a fold column is. This is done with the
`l_starts`

method in `groupdata2::group()`

, which
automatically starts a new group every time the value of a column
changes. So
`c(".folds_1", ".folds_1", ".folds_2", ".folds_2")`

would
give the groups `c(1, 1, 2, 2)`

. By first grouping the data
frame by the number of folds, these group indices start over for each
setting of `k`

. Note though, that this assumes that the
`Fold Column`

is sorted correctly. We should also remember to
remove the grouping in the end, if we don’t need it in the following
step.

```
# Add a column for indicating the repetition of the `k` setting
fold_column_results <- fold_column_results %>%
# Group to restart groupdata2 group numbers for each `Num Folds` setting
dplyr::group_by(`Num Folds`) %>%
# Create a group whenever the `Fold Column` column changes
groupdata2::group(n = 'auto',
method = 'l_starts',
starts_col = "Fold Column",
col_name = "Repetition") %>%
dplyr::ungroup()
# Inspect the columns relevant to this
fold_column_results %>%
dplyr::select(`Fold Column`, `Num Folds`, Formula, Repetition) %>%
head(20) %>%
knitr::kable()
```

Fold Column | Num Folds | Formula | Repetition |
---|---|---|---|

.folds_1 | 3 | SL+SW | 1 |

.folds_1 | 3 | PL+PW | 1 |

.folds_1 | 3 | SL+SW+PL+PW | 1 |

.folds_1 | 3 | SLSW+PLPW |
1 |

.folds_2 | 3 | SL+SW | 2 |

.folds_2 | 3 | PL+PW | 2 |

.folds_2 | 3 | SL+SW+PL+PW | 2 |

.folds_2 | 3 | SLSW+PLPW |
2 |

.folds_3 | 3 | SL+SW | 3 |

.folds_3 | 3 | PL+PW | 3 |

.folds_3 | 3 | SL+SW+PL+PW | 3 |

.folds_3 | 3 | SLSW+PLPW |
3 |

.folds_4 | 8 | SL+SW | 1 |

.folds_4 | 8 | PL+PW | 1 |

.folds_4 | 8 | SL+SW+PL+PW | 1 |

.folds_4 | 8 | SLSW+PLPW |
1 |

.folds_5 | 8 | SL+SW | 2 |

.folds_5 | 8 | PL+PW | 2 |

.folds_5 | 8 | SL+SW+PL+PW | 2 |

.folds_5 | 8 | SLSW+PLPW |
2 |

To plot the average lines for each formula, we calculate the average
`Balanced Accuracy`

for each formula, for each number of
folds setting:

```
avg_balanced_acc <- fold_column_results %>%
dplyr::group_by(`Num Folds`, Formula) %>%
dplyr::summarise(`Balanced Accuracy` = mean(`Balanced Accuracy`),
.groups = "drop")
avg_balanced_acc
#> # A tibble: 40 × 3
#> `Num Folds` Formula `Balanced Accuracy`
#> <dbl> <chr> <dbl>
#> 1 3 PL+PW 0.97
#> 2 3 SL*SW+PL*PW 0.978
#> 3 3 SL+SW 0.825
#> 4 3 SL+SW+PL+PW 0.973
#> 5 8 PL+PW 0.967
#> 6 8 SL*SW+PL*PW 0.97
#> 7 8 SL+SW 0.842
#> 8 8 SL+SW+PL+PW 0.967
#> 9 13 PL+PW 0.967
#> 10 13 SL*SW+PL*PW 0.98
#> # ℹ 30 more rows
```

With the data ready, we plot the effect of `k`

on the
`Balanced Accuracy`

metric. Feel free to plot one of the
other metrics as well!

```
# Plot the balanced accuracy by the number of folds
# We add jitter to the points to separate overlapping points slightly
fold_column_results %>%
ggplot(aes(x = `Num Folds`, y = `Balanced Accuracy`, color = Formula)) +
geom_point(
aes(shape = Repetition),
size = 1,
position = position_jitter(h = 0.0, w = 0.6)) +
geom_line(data = avg_balanced_acc) +
theme_minimal()
```

For this dataset, the ranking of the models seems somewhat stable,
although the `PL+PW`

and `SL+SW+PL+PW`

models are
so close that it might switch their ranking at times. The variation in
the three points at each `k`

setting (for each formula) shows
why repeated cross-validation is a good idea when possible. Without it,
the random split can have a much bigger effect on the results (and
potentially model ranking). By relying on the average of multiple
`k`

settings and repetitions, our results are more robust to
fluctuations.

If you don’t want to run all these models always (e.g. in
production), running this analysis at the beginning (and perhaps once in
a while, in case of data drift) might help you check whether the choice
of `k`

makes a difference with your type of data.

Finally, let’s have a look at the **class level** fold
column results from the one-vs-all evaluations. These describe how well
a model did for each of the species, for each of the fold columns. We
will make a plot similar to the above to see whether the `k`

setting affects the class level performance.

The cross-validation output has a nested tibble called
`Class Level Results`

for each formula. Within such tibble,
we find another nested tibble (`Results`

) that contains the
results for each fold column, for each species.

To get the class level fold column results for the best model
(i.e. `Sepal.Length * Sepal.Width + Petal.Length * Petal.Width`

),
we thus first get the `Class Level Results`

for this (fourth)
model and then extract the `Results`

from that.

This gives us a list with the fold column results for each species which we concatenate to a single data frame.

```
# Extract class level fold column results for the best model
# It is a list of tibbles (one for each species)
# so we concatenate them to a single tibble with bind_rows()
class_level_fold_results <- cv$`Class Level Results`[[4]]$Results %>%
dplyr::bind_rows()
# Add the Num Folds counts
class_level_fold_results <- fold_counts_df %>%
dplyr::left_join(class_level_fold_results, by = "Fold Column")
class_level_fold_results
#> # A tibble: 90 × 13
#> `Fold Column` `Num Folds` Class `Balanced Accuracy` F1 Sensitivity
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 .folds_1 3 setosa 1 1 1
#> 2 .folds_1 3 versicolor 0.965 0.959 0.94
#> 3 .folds_1 3 virginica 0.975 0.961 0.98
#> 4 .folds_2 3 setosa 1 1 1
#> 5 .folds_2 3 versicolor 0.96 0.949 0.94
#> 6 .folds_2 3 virginica 0.965 0.950 0.96
#> 7 .folds_3 3 setosa 1 1 1
#> 8 .folds_3 3 versicolor 0.96 0.958 0.92
#> 9 .folds_3 3 virginica 0.98 0.962 1
#> 10 .folds_4 8 setosa 0.99 0.990 0.98
#> # ℹ 80 more rows
#> # ℹ 7 more variables: Specificity <dbl>, `Pos Pred Value` <dbl>,
#> # `Neg Pred Value` <dbl>, Kappa <dbl>, `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>
```

Again, we add the repetition column, using the `l_starts`

method in `groupdata2::group()`

:

```
# Add a column for indicating the repetition of the `k` setting
class_level_fold_results <- class_level_fold_results %>%
# Group to restart groupdata2 group numbers for each `Num Folds` setting
dplyr::group_by(`Num Folds`) %>%
# Create a group whenever the `Fold Column` column changes
groupdata2::group(n = 'auto',
method = 'l_starts',
starts_col = "Fold Column",
col_name = "Repetition") %>%
dplyr::ungroup()
# Inspect the columns relevant to this
class_level_fold_results %>%
dplyr::select(`Fold Column`, `Num Folds`, Class, Repetition)
#> # A tibble: 90 × 4
#> `Fold Column` `Num Folds` Class Repetition
#> <chr> <dbl> <chr> <fct>
#> 1 .folds_1 3 setosa 1
#> 2 .folds_1 3 versicolor 1
#> 3 .folds_1 3 virginica 1
#> 4 .folds_2 3 setosa 2
#> 5 .folds_2 3 versicolor 2
#> 6 .folds_2 3 virginica 2
#> 7 .folds_3 3 setosa 3
#> 8 .folds_3 3 versicolor 3
#> 9 .folds_3 3 virginica 3
#> 10 .folds_4 8 setosa 1
#> # ℹ 80 more rows
```

To plot the average lines for each formula, we calculate the average
`Balanced Accuracy`

for each class, for each number of folds
setting:

```
class_level_avg_balanced_acc <- class_level_fold_results %>%
dplyr::group_by(`Num Folds`, Class) %>%
dplyr::summarise(`Balanced Accuracy` = mean(`Balanced Accuracy`),
.groups = "drop")
class_level_avg_balanced_acc
#> # A tibble: 30 × 3
#> `Num Folds` Class `Balanced Accuracy`
#> <dbl> <chr> <dbl>
#> 1 3 setosa 1
#> 2 3 versicolor 0.962
#> 3 3 virginica 0.973
#> 4 8 setosa 0.993
#> 5 8 versicolor 0.955
#> 6 8 virginica 0.962
#> 7 13 setosa 0.997
#> 8 13 versicolor 0.97
#> 9 13 virginica 0.973
#> 10 19 setosa 0.993
#> # ℹ 20 more rows
```

Now, we can plot the `Balanced Accuracy`

by the number of
folds:

```
# Plot the balanced accuracy by the number of folds
# We add jitter to the points to separate overlapping points slightly
class_level_fold_results %>%
ggplot(aes(x = `Num Folds`, y = `Balanced Accuracy`, color = Class)) +
geom_point(aes(shape = Repetition), size = 1,
position = position_jitter(h = 0.0, w = 0.6)) +
geom_line(data = class_level_avg_balanced_acc) +
theme_minimal()
```

The Versicolor and Virginica species seem affected by the number of
folds. They both have slightly lower average balanced accuracies at the
lower `k`

settings.

In this vignette, we have covered the choice of the “number of folds”
setting when using cross-validation. We have discussed why larger
`k`

settings should give lower prediction errors on average
and shown how to make results robust to this setting by averaging over a
range of `k`

values. The `groupdata2::fold()`

and
`cvms`

cross-validation functions enable this type of
analysis.

This concludes the vignette. If elements are unclear or you need help to apply this to your context, you can leave feedback in a mail or in a GitHub issue :-)