The task of model selection targets the question: If there are
several competing models, how do I choose the most appropriate one? This
vignette^{1}
outlines the model selection tools implemented in
`{RprobitB}`

.

For illustration, we revisit the probit model of travelers deciding between two fictional train route alternatives from the vignette on model fitting:

```
form <- choice ~ price + time + change + comfort | 0
data <- prepare_data(form = form, choice_data = train_choice, id = "deciderID", idc = "occasionID")
model_train <- fit_model(
data = data,
scale = "price := -1",
R = 1000, B = 500, Q = 10
)
```

As a competing model, we consider explaining the choices only by the
alternative’s price, i.e. the probit model with the formula
`choice ~ price | 0`

:

The `update()`

function helps to estimate a new version of
`model_train`

with new specifications. Here, only
`form`

has been changed.

`model_selection()`

function`{RprobitB}`

provides the convenience function
`model_selection()`

, which takes an arbitrary number of
`RprobitB_fit`

objects and returns a matrix of model
selection criteria:

```
model_selection(model_train, model_train_sparse)
#> model_train model_train_sparse
#> npar 4 1
#> LL -1727.71 -1865.86
#> AIC 3463.42 3733.72
#> BIC 3487.35 3739.70
```

Specifying `criteria`

is optional. Per default,
`criteria = c("npar", "LL", "AIC", "BIC")`

.^{2} The available model
selection criteria are explained in the following.

`npar`

`"npar"`

yields the number of model parameters, which is
computed by the `npar()`

method:

Here, `model_train`

has 4 parameters (a coefficient for
price, time, change, and comfort, respectively), while
`model_train_sparse`

has only a single price coefficient.

`LL`

If `"LL"`

is included in `criteria`

,
`model_selection()`

returns the model’s log-likelihood
values. They can also be directly accessed via the `logLik()`

method:^{3}

`AIC`

Including `"AIC"`

yields the Akaike’s Information
Criterion (Akaike
1974), which is computed as \[-2
\cdot \text{LL} + k \cdot \text{npar},\] where \(\text{LL}\) is the model’s
log-likelihood value, \(k\) is the
penalty per parameter with \(k = 2\)
per default for the classical AIC, and \(\text{npar}\) is the number of parameters
in the fitted model.

Alternatively, the `AIC()`

method also returns the AIC
values:

```
AIC(model_train, model_train_sparse, k = 2)
#> df AIC
#> model_train 4 3463.419
#> model_train_sparse 1 3733.720
```

The AIC quantifies the trade-off between over- and under-fitting,
where smaller values are preferred. Here, the increase in goodness of
fit justifies the additional 3 parameters of
`model_train`

.

`BIC`

Similar to the AIC, `"BIC"`

yields the Bayesian
Information Criterion (Schwarz 1978), which is defined as
\[-2 \cdot \text{LL} + \log{(\text{nobs})}
\cdot \text{npar},\] where \(\text{LL}\) is the model’s
log-likelihood value, \(\text{nobs}\) is the number of data points
(which can be accessed via the `nobs()`

method), and \(\text{npar}\) is the number
of parameters in the fitted model. The interpretation is analogue to
AIC.

`{RprobitB}`

also provided a method for the
`BIC`

value:

`WAIC`

(with `se(WAIC)`

and
`pWAIC`

)WAIC is short for Widely Applicable (or Watanabe-Akaike) Information
Criterion (Watanabe
and Opper 2010). As for AIC and BIC, the smaller the WAIC
value the better the model. Including `"WAIC"`

in
`criteria`

yields the WAIC value, its standard error
`se(WAIC)`

, and the effective number of parameters
`pWAIC`

, see below.

The WAIC is defined as

\[-2 \cdot \text{lppd} + 2\cdot p_\text{WAIC},\]

where \(\text{lppd}\) stands for log pointwise predictive density and \(p_\text{WAIC}\) is a penalty term proportional to the variance in the posterior distribution that is sometimes called effective number of parameters, see McElreath (2020) p. 220 for a reference.

The \(\text{lppd}\) is approximated as follows. Let \[p_{si} = \Pr(y_i\mid \theta_s)\] be the probability of observation \(y_i\) (here the single choices) given the \(s\)-th set \(\theta_s\) of parameter samples from the posterior. Then

\[\text{lppd} = \sum_i \log \left( S^{-1} \sum_s p_{si} \right).\] The penalty term is computed as the sum over the variances in log-probability for each observation: \[p_\text{WAIC} = \sum_i \mathbb{V}_{\theta} \log (p_{si}) . \] The \(\text{WAIC}\) has a standard error of \[\sqrt{n \cdot \mathbb{V}_i \left[-2 \left(\text{lppd} - \mathbb{V}_{\theta} \log (p_{si}) \right)\right]},\] where \(n\) is the number of choices.

Before computing the WAIC of an object, the probabilities \(p_{si}\) must be computed via the
`compute_p_si()`

function: ^{4}

Afterwards, the WAIC can be accessed as follows:

`MMLL`

`"MMLL"`

in `criteria`

stands for marginal
model log-likelihood. The model’s marginal likelihood \(\Pr(y\mid M)\) for a model \(M\) and data \(y\) is required for the computation of
Bayes factors, see below. In general, the term has no closed form and
must be approximated numerically.

`{RprobitB}`

uses the posterior Gibbs samples derived from
the `mcmc()`

function to approximate the model’s marginal
likelihood via the posterior harmonic mean estimator (Newton and Raftery
1994): Let \(S\) denote the
number of available posterior samples \(\theta_1,\dots,\theta_S\). Then, \[\Pr(y\mid M) = \left(\mathbb{E}_\text{posterior}
1/\Pr(y\mid \theta,M) \right)^{-1} \approx \left( \frac{1}{S} \sum_s
1/\Pr(y\mid \theta_s,M) \right) ^{-1} = \tilde{\Pr}(y\mid
M).\]

By the law of large numbers, \(\tilde{\Pr}(y\mid M) \to \Pr(y\mid M)\) almost surely as \(S \to \infty\).

As for the WAIC, computing the MMLL relies on the
probabilities \(p_{si} = \Pr(y_i\mid
\theta_s)\), which must first be computed via the
`compute_p_si()`

function. Afterwards, the `mml()`

function can be called with an `RprobitB_fit`

object as
input. The function returns the `RprobitB_fit`

object, where
the marginal likelihood value is saved as the entry `"mml"`

and the marginal log-likelihood value as the attribute
`"mmll"`

:^{5}

There are two options for improving the approximation: As for the WAIC, you can use more posterior samples. Alternatively, you can combine the posterior harmonic mean estimate with the prior arithmetic mean estimator (Hammersley and Handscomb 1964): For this approach, \(S\) samples \(\theta_1,\dots,\theta_S\) are drawn from the model’s prior distribution. Then,

\[\Pr(y\mid M) = \mathbb{E}_\text{prior} \Pr(y\mid \theta,M) \approx \frac{1}{S} \sum_s \Pr(y\mid \theta_s,M) = \tilde{\Pr}(y\mid M).\]

Again, it hols by the law of large numbers, that \(\tilde{\Pr}(y\mid M) \to \Pr(y\mid M)\) almost surely as \(S \to \infty\). The final approximation of the model’s marginal likelihood than is a weighted sum of the posterior harmonic mean estimate and the prior arithmetic mean estimate, where the weights are determined by the sample sizes.

To use the prior arithmetic mean estimator, call the
`mml()`

function with a specification of the number of prior
draws `S`

and set `recompute = TRUE`

:^{6}

Note that the prior arithmetic mean estimator works well if the prior
and posterior distribution have a similar shape and strong overlap, as
Gronau et al. (2017) points out. Otherwise, most of
the sampled prior values result in a likelihood value close to zero,
thereby contributing only marginally to the approximation. In this case,
a very large number `S`

of prior samples is required.

`BF`

The Bayes factor is an index of relative posterior model plausibility of one model over another (Marin and Robert 2014). Given data \(\texttt{y}\) and two models \(\texttt{mod0}\) and \(\texttt{mod1}\), it is defined as

\[ BF(\texttt{mod0},\texttt{mod1}) = \frac{\Pr(\texttt{mod0} \mid \texttt{y})}{\Pr(\texttt{mod1} \mid \texttt{y})} = \frac{\Pr(\texttt{y} \mid \texttt{mod0} )}{\Pr(\texttt{y} \mid \texttt{mod1})} / \frac{\Pr(\texttt{mod0})}{\Pr(\texttt{mod1})}. \]

The ratio \(\Pr(\texttt{mod0}) /
\Pr(\texttt{mod1})\) expresses the factor by which \(\texttt{mod0}\) a priori is assumed to be
the correct model. Per default, `{RprobitB}`

sets \(\Pr(\texttt{mod0}) = \Pr(\texttt{mod1}) =
0.5\). The front part \(\Pr(\texttt{y}
\mid \texttt{mod0} ) / \Pr(\texttt{y} \mid \texttt{mod1})\) is
the ratio of the marginal model likelihoods. A value
of \(BF(\texttt{mod0},\texttt{mod1}) >
1\) means that the model \(\texttt{mod0}\) is more strongly supported
by the data under consideration than \(\texttt{mod1}\).

Adding `"BF"`

to the `criteria`

argument of
`model_selection`

yields the Bayes factors. We find a
decisive evidence (Jeffreys 1998) in favor of
`model_train`

.

`pred_acc`

Finally, adding `"pred_acc"`

to the `criteria`

argument for the `model_selection()`

function returns the
share of correctly predicted choices. From the output of
`model_selection()`

above (or alternatively the one in the
following) we deduce that `model_train`

correctly predicts
about 6% of the choices more than `model_train_sparse`

:^{7}

Akaike, H. 1974. “A New Look at the
Statistical Model Identification.” *IEEE Transactions
on Automatic Control* 19. https://doi.org/10.1109/TAC.1974.1100705.

Gronau, Q. F., A. Sarafoglou, D. Matzke, A. Ly, U. Boehm, M. Marsman, D.
S. Leslie, J. J. Forster, E. Wagenmakers, and H. Steingroever. 2017.
“A Tutorial on Bridge Sampling.” *Journal of
Mathematical Psychology* 81. https://doi.org/10.1016/j.jmp.2017.09.005.

Hammersley, J. M., and D. C. Handscomb. 1964. “General Principles
of the Monte Carlo Method.” In *Monte Carlo
Methods*, edited by M. S. Bartlett and D. R. Cox, 50–75. Dordrecht:
Springer Netherlands. https://doi.org/10.1007/978-94-009-5819-7_5.

Jeffreys, Harold. 1998. *The Theory of Probability*. OUP Oxford.

Marin, J., and C. Robert. 2014. *Bayesian
Essentials with R*. Springer Verlag. https://doi.org/10.1007/978-1-4614-8687-9.

McElreath, R. 2020. *Statistical Rethinking: A Bayesian
Course with Examples in R and Stan*.
Chapman; Hall/CRC. https://doi.org/10.1201/9780429029608.

Newton, M. A., and A. E. Raftery. 1994. “Approximate
Bayesian Inference with the Weighted Likelihood
Bootstrap.” *Journal of the Royal Statistical Society: Series
B (Methodological)* 56 (1).

Schwarz, G. 1978. “Estimating the Dimension of a Model.”
*The Annals of Statistics* 6.

Watanabe, S., and M. Opper. 2010. “Asymptotic Equivalence of
Bayes Cross Validation and Widely Applicable Information
Criterion in Singular Learning Theory.” *Journal of Machine
Learning Research* 11 (12).

This vignette is built using R 4.1.2 with the

`{RprobitB}`

1.1.4 package.↩︎To show the model formulas in the output of

`model_selection()`

, add the argument`add_form = TRUE`

.↩︎The log-likelihood values are per default computed at the point estimates derived from the Gibbs sample means.

`logLik()`

has the argument`par_set`

, where alternative statistics for the point estimate can be specified.↩︎This computation has been outsourced because it is very time consuming. For example, the computation for

`model_train`

took about 5 minutes. To decrease the computation time, the function offers parallelization via specifying the number`ncores`

of parallel CPU cores.↩︎Note that the marginal likelihood value is very small. The given representation is required so that the value is not rounded to 0 by the computer.↩︎

The

`mml()`

function offers parallelization via specifying the number`ncores`

of parallel CPU cores.↩︎See the vignette on choice prediction for more nuanced performance comparison in terms of sensitivity and specificity.↩︎