NeuralEstimators

Matthew Sainsbury-Dale, Andrew Zammit-Mangion, and Raphaël Huser

This vignette describes the R interface to NeuralEstimators, a Julia package that facilitates a suite of neural methods for simulation-based inference. These methods are likelihood-free and amortised, in the sense that, once the neural networks are trained on simulated data, they enable rapid inference across arbitrarily many observed data sets in a fraction of the time required by conventional approaches. The package accommodates any model for which simulation is feasible by allowing users to define their model implicitly through simulated data.

Comprehensive documentation for the Julia version of the package is available here.

1 Methodology

Here, we provide an overview of the amortised neural inferential methods supported by the package, which include neural Bayes estimators (Section 1.1), neural posterior estimators (Section 1.2), and neural ratio estimators (Section 1.3). For further details on each of these methods and amortised neural inference more broadly, see the review paper by Zammit-Mangion et al. (2025) and the references therein.

Notation: We denote model parameters of interest by \(\boldsymbol{\theta} \equiv (\theta_1, \dots, \theta_d)' \in \Theta\), where \(\Theta \subseteq \mathbb{R}^d\) is the parameter space. We denote data by \(\boldsymbol{Z} \equiv (Z_1, \dots, Z_n)' \in \mathscr{Z}\), where \(\mathscr{Z} \subseteq \mathbb{R}^n\) is the sample space. We denote neural-network parameters by \(\boldsymbol{\gamma}\). For simplicity, we assume that all measures admit densities with respect to the Lebesgue measure. We use \(\pi(\cdot)\) to denote the prior density function of the parameters. The input argument to a generic density function \(p(\cdot)\) serves to specify both the random variable associated with the density and its evaluation point.

1.1 Neural Bayes estimators

The goal of parametric point estimation is to estimate \(\boldsymbol{\theta}\) from data \(\boldsymbol{Z}\) using an estimator, \(\hat{\boldsymbol{\theta}} : \mathscr{Z}\to\Theta\). Estimators can be constructed intuitively within a decision-theoretic framework based on average-risk optimality. Specifically, consider a loss function \(L: \Theta \times \Theta \to [0, \infty)\). Then the Bayes risk of the estimator \(\hat{\boldsymbol{\theta}}(\cdot)\) is
\[ \int_\Theta \int_{\mathscr{Z}} L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}(\boldsymbol{Z}))p(\boldsymbol{Z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) \textrm{d} \boldsymbol{Z} \textrm{d}\boldsymbol{\theta}. \] Any minimiser of the Bayes risk is said to be a Bayes estimator with respect to \(L(\cdot, \cdot)\) and \(\pi(\cdot)\).

Bayes estimators are functionals of the posterior distribution (e.g., the Bayes estimator under quadratic loss is the posterior mean), and are therefore often unavailable in closed form. A way forward is to assume a flexible parametric function for \(\hat{\boldsymbol{\theta}}(\cdot)\), and to optimise the parameters within that function in order to approximate the Bayes estimator. Neural networks are ideal candidates, since they are universal function approximators, and because they are fast to evaluate. Let \(\hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma}} : \mathscr{Z}\to\Theta\) denote a neural network parameterised by \(\boldsymbol{\gamma}\). Then a Bayes estimator may be approximated by \(\hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma^*}}(\cdot)\), where \[ \boldsymbol{\gamma}^* \equiv \underset{\boldsymbol{\gamma}}{\mathrm{arg\,min\;}} \frac{1}{K} \sum_{k = 1}^K L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(k)})), \] with \(\boldsymbol{\theta}^{(k)} \sim \pi(\boldsymbol{\theta})\) and, independently for each \(k\), \(\boldsymbol{Z}^{(k)} \sim p(\boldsymbol{Z} \mid \boldsymbol{\theta}^{(k)})\). The process of obtaining \(\boldsymbol{\gamma}^*\) is referred to as “training the network”, and this can be performed efficiently using back-propagation and stochastic gradient descent. The trained neural network \(\hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma^*}}(\cdot)\) approximately minimises the Bayes risk, and therefore it is called a neural Bayes estimator (Sainsbury-Dale et al., 2024).

Once trained, a neural Bayes estimator can be applied repeatedly to observed data (whose structure conforms with the chosen neural-network architecture) at a fraction of the computational cost of conventional inferential methods. It is therefore ideal to use a neural Bayes estimator in settings where inference needs to be made repeatedly; in this case, the initial training cost is said to be amortised over time.

1.1.1 Uncertainty quantification with neural Bayes estimators

Uncertainty quantification with neural Bayes estimators often proceeds through the bootstrap distribution (e.g., Lenzi et al., 2023; Richards et al., 2024; Sainsbury-Dale et al., 2024). Bootstrap-based approaches are particularly attractive when nonparametric bootstrap is possible (e.g., when the data are independent replicates), or when simulation from the fitted model is fast, in which case parametric bootstrap is also computationally efficient. However, these conditions are not always met and, although bootstrap-based approaches are often considered to be fairly accurate and favourable to methods based on asymptotic normality, there are situations where bootstrap procedures are not reliable (see, e.g., Canty et al., 2006, pg. 6).

Alternatively, by leveraging ideas from (Bayesian) quantile regression, one may construct a neural Bayes estimator that approximates a set of marginal posterior quantiles (Fisher et al., 2023; Sainsbury-Dale et al., 2025), which can then be used to construct credible intervals for each parameter. Inference then remains fully amortised since, once the estimators are trained, both point estimates and credible intervals can be obtained with virtually zero computational cost. Specifically, posterior quantiles can be targeted by training a neural Bayes estimator under the loss function \[ L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}; \tau) \equiv \sum_{j=1}^d (\hat{\theta}_j - \theta_j)\{\mathbb{I}(\hat{\theta}_j - \theta_j) - \tau\}, \quad 0 < \tau < 1, \] where \(\mathbb{I}(\cdot)\) denotes the indicator function, since the Bayes estimator under this loss function is the vector of marginal posterior \(\tau\)-quantiles (Sainsbury-Dale et al., 2025, Sec. 2.2.4).

1.2 Neural posterior estimators

We now describe amortised approximate posterior inference through the minimisation of an expected Kullback–Leibler (KL) divergence. Throughout, we let \(q(\boldsymbol{\theta}; \boldsymbol{\kappa})\) denote a parametric approximation to the posterior distribution \(p(\boldsymbol{\theta} \mid \boldsymbol{Z})\), where the approximate-distribution parameters \(\boldsymbol{\kappa}\) belong to a space \(\mathscr{K}\).

We first consider the non-amortised case, where the optimal parameters \(\boldsymbol{\kappa}^*\) for a single data set \(\boldsymbol{Z}\) are found by minimising the KL divergence between \(p(\boldsymbol{\theta} \mid \boldsymbol{Z})\) and \(q(\boldsymbol{\theta}; \boldsymbol{\kappa})\): \[ \boldsymbol{\kappa}^* \equiv \underset{\boldsymbol{\kappa}}{\mathrm{arg\,min\;}} \textrm{KL}\{p(\boldsymbol{\theta} \mid \boldsymbol{Z}) \, \| \, q(\boldsymbol{\theta} ;\boldsymbol{\kappa})\} = \underset{\boldsymbol{\kappa}}{\mathrm{arg\,min\;}} -\int_{\Theta} \log\{ q(\boldsymbol{\theta} ;\boldsymbol{\kappa})\} p(\boldsymbol{\theta} \mid \boldsymbol{Z}) \textrm{d}\boldsymbol{\theta} . \] The resulting approximate posterior \(q(\boldsymbol{\theta}; \boldsymbol{\kappa}^*)\) targets the true posterior in the sense that the KL divergence is zero if and only if \(q(\boldsymbol{\theta}; \boldsymbol{\kappa}^*) = p(\boldsymbol{\theta} \mid \boldsymbol{Z})\) for all \(\boldsymbol{\theta} \in \Theta\). However, solving this optimisation problem is often computationally demanding even for a single data set \(\boldsymbol{Z}\), and solving it for many different data sets can be computationally prohibitive. The optimisation problem can be amortised by treating the parameters \(\boldsymbol{\kappa}\) as a function \(\boldsymbol{\kappa} : \mathscr{Z} \to \mathscr{K}\), and then choosing the function \(\boldsymbol{\kappa}^*(\cdot)\) that minimises an expected KL divergence: \[ \boldsymbol{\kappa}^*(\cdot) \equiv \underset{\boldsymbol{\kappa}(\cdot)}{\mathrm{arg\,min\;}} \mathbb{E}_{\boldsymbol{Z}}[\textrm{KL}\{p(\boldsymbol{\theta} \mid \boldsymbol{Z}) \, \| \, q(\boldsymbol{\theta} ;\boldsymbol{\kappa}(\boldsymbol{Z}))\}]. \] In practice, we approximate \(\boldsymbol{\kappa}^*(\cdot)\) using a neural network, \(\boldsymbol{\kappa}_{\boldsymbol{\gamma}} : \mathscr{Z} \to \mathscr{K}\), which is parameterised by \(\boldsymbol{\gamma}\) and trained by minimising a Monte Carlo approximation of the expected KL divergence above: \[ \boldsymbol{\gamma}^* \equiv \underset{\boldsymbol{\gamma}}{\mathrm{arg\,min}} -\sum_{k=1}^K \log q(\boldsymbol{\theta}^{(k)}; \boldsymbol{\kappa}_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(k)})). \] Once trained, the neural network \(\boldsymbol{\kappa}_{\boldsymbol{\gamma}^*}(\cdot)\) may be used to estimate the optimal approximate-distribution parameters \(\boldsymbol{\kappa}^*\) given data \(\boldsymbol{Z}\) at almost no computational cost. The neural network \(\boldsymbol{\kappa}_{\boldsymbol{\gamma}^*}(\cdot)\), together with the corresponding approximate distribution \(q(\cdot; \boldsymbol{\kappa})\), is collectively referred to as a neural posterior estimator.

There are numerous options for the approximate distribution \(q(\cdot; \boldsymbol{\kappa})\). For instance, \(q(\cdot;\boldsymbol{\kappa})\) can be modelled as a Gaussian distribution (e.g., Chan et al., 2018), where the parameters \(\boldsymbol{\kappa} = (\boldsymbol{\mu}', \textrm{vech}(\boldsymbol{L})')'\) consist of a \(d\)-dimensional mean parameter \(\boldsymbol{\mu}\) and the \(d(d+1)/2\) non-zero elements of the lower Cholesky factor \(\boldsymbol{L}\) of a covariance matrix, and the half-vectorisation operator \(\textrm{vech}(\cdot)\) vectorises the lower triangle of its matrix argument. One may also consider Gaussian mixtures (e.g., Papamakarios & Murray, 2016) or trans-Gaussian distributions (e.g., Maceda et al., 2024). However, the most widely adopted approach is to model \(q(\cdot\,; \boldsymbol{\kappa})\) using a normalising flow (e.g., Ardizzone et al., 2019; Radev et al., 2022), excellent reviews for which are given by Kobyzev et al. (2020) and Papamakarios et al. (2021). A particularly popular class of normalising flow is the affine coupling flow (e.g., Dinh et al., 2016; Kingma & Dhariwal, 2018; Ardizzone et al., 2019). Since affine coupling flows are universal density approximators (Teshima et al., 2020), they serve as the default and recommended choice for approximate distributions in this package; for further details, see the Julia documentation for NormalisingFlow.

1.3 Neural ratio estimators

Finally, we describe amortised inference by approximation of the likelihood-to-evidence ratio, \[ r(\boldsymbol{Z}, \boldsymbol{\theta}) \equiv p(\boldsymbol{Z} \mid \boldsymbol{\theta})/p(\boldsymbol{Z}), \] where \(p(\boldsymbol{Z} \mid \boldsymbol{\theta})\) is the likelihood and \(p(\boldsymbol{Z})\) is the marginal likelihood (also known as the model evidence).

The likelihood-to-evidence ratio is ubiquitous in statistical inference. For example, likelihood ratios of the form \(p(\boldsymbol{Z}\mid \boldsymbol{\theta}_0)/p(\boldsymbol{Z}\mid \boldsymbol{\theta}_1)=r(\boldsymbol{Z}, \boldsymbol{\theta}_0)/r(\boldsymbol{Z}, \boldsymbol{\theta}_1)\) are central to hypothesis testing and model comparison, and naturally appear in the transition probabilities of most standard MCMC algorithms used for Bayesian inference. Further, since the likelihood-to-evidence ratio is a prior-free quantity, its approximation facilitates Bayesian inference in applications where one requires multiple fits of the model under different prior distributions.

Unlike the methods discussed earlier, the likelihood-to-evidence ratio might not immediately seem like a quantity well-suited for approximation by neural networks, which are typically trained by minimising empirical risk functions. However, this ratio emerges naturally as a simple transformation of the optimal solution to a standard binary classification problem, derived through the minimisation of an average risk. Specifically, consider a binary classifier \(c(\boldsymbol{Z}, \boldsymbol{\theta})\) that distinguishes dependent data-parameter pairs \({(\boldsymbol{Z}', \boldsymbol{\theta}')' \sim p(\boldsymbol{Z}, \boldsymbol{\theta})}\) with class labels \(Y=1\) from independent data-parameter pairs \({(\tilde{\boldsymbol{Z}}', \tilde{\boldsymbol{\theta}}')' \sim p(\boldsymbol{Z})p(\boldsymbol{\theta})}\) with class labels \(Y=0\), and where the classes are balanced. Here, \(p(\boldsymbol{\theta})\) denotes an arbitrary “proposal” distribution for \(\boldsymbol{\theta}\) that does not, in general, coincide with the prior distribution (see below). Then, the Bayes classifier under binary cross-entropy loss is defined as \[ \begin{aligned} c^*(\cdot, \cdot) &\equiv \underset{c(\cdot, \cdot)}{\mathrm{arg\,min}} \sum_{y\in\{0, 1\}} \textrm{Pr}(Y = y) \int_\Theta\int_\mathscr{Z}L_{\textrm{BCE}}\{y, c(\boldsymbol{Z}, \boldsymbol{\theta})\}p(\boldsymbol{Z}, \boldsymbol{\theta} \mid Y = y)\textrm{d} \boldsymbol{Z} \textrm{d} \boldsymbol{\theta}\\ &= \underset{c(\cdot, \cdot)}{\mathrm{arg\,min}} - \int_\Theta\int_\mathscr{Z}\Big[\log\{c(\boldsymbol{Z}, \boldsymbol{\theta})\}p(\boldsymbol{Z}, \boldsymbol{\theta}) + \log\{1 - c(\boldsymbol{Z}, \boldsymbol{\theta})\}p(\boldsymbol{Z})p(\boldsymbol{\theta}) \Big]\textrm{d} \boldsymbol{Z} \textrm{d} \boldsymbol{\theta}, \end{aligned} \] where \(L_{\textrm{BCE}}(y, c) \equiv -y\log(c) - (1 - y) \log(1 - c)\). It can be shown (e.g., Hermans et al., 2020, App. B) that the Bayes classifier is given by \[ c^*(\boldsymbol{Z}, \boldsymbol{\theta}) = \frac{p(\boldsymbol{Z}, \boldsymbol{\theta})}{p(\boldsymbol{Z}, \boldsymbol{\theta}) + p(\boldsymbol{\theta})p(\boldsymbol{Z})}, \quad \boldsymbol{Z} \in \mathscr{Z}, \boldsymbol{\theta} \in \Theta, \] and, hence, \[ r(\boldsymbol{Z}, \boldsymbol{\theta}) = \frac{c^*(\boldsymbol{Z}, \boldsymbol{\theta})}{1 - c^*(\boldsymbol{Z}, \boldsymbol{\theta})}, \quad \boldsymbol{Z} \in \mathscr{Z}, \boldsymbol{\theta} \in \Theta. \] This connection links the likelihood-to-evidence ratio to the average-risk-optimal solution of a standard binary classification problem, and consequently provides a foundation for approximating the ratio using neural networks. Specifically, let \(c_{\boldsymbol{\gamma}}: \mathscr{Z} \times \Theta \to (0, 1)\) denote a neural network parametrised by \(\boldsymbol{\gamma}\). Then the Bayes classifier may be approximated by \(c_{\boldsymbol{\gamma}^*}(\cdot, \cdot)\), where \[ \boldsymbol{\gamma}^* \equiv \underset{\boldsymbol{\gamma}}{\mathrm{arg\,min}} -\sum_{k=1}^K \Big[\log\{c_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(k)}, \boldsymbol{\theta}^{(k)})\} + \log\{1 - c_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(\sigma(k))}, \boldsymbol{\theta}^{(k)})\} \Big], \] with each \(\boldsymbol{\theta}^{(k)}\) sampled independently from a “proposal” distribution \(p(\boldsymbol{\theta})\), \(\boldsymbol{Z}^{(k)} \sim p(\boldsymbol{Z} \mid \boldsymbol{\theta}^{(k)})\), and \(\sigma(\cdot)\) a random permutation of \(\{1, \dots, K\}\). The proposal distribution \(p(\boldsymbol{\theta})\) does not necessarily correspond to the prior distribution \(\pi(\boldsymbol{\theta})\), which is specified in the downstream inference algorithm (see below). In theory, any \(p(\boldsymbol{\theta})\) with support over \(\Theta\) can be used. However, with finite training data, the choice of \(p(\boldsymbol{\theta})\) is important, as it determines where the parameters \(\{\boldsymbol{\theta}^{(k)}\}\) are most densely sampled and, hence, where the neural network \(c_{\boldsymbol{\gamma}^*}(\cdot, \cdot)\) best approximates the Bayes classifier. Further, since neural networks are only reliable within the support of their training samples, a \(p(\boldsymbol{\theta})\) lacking full support over \(\Theta\) essentially acts as a “soft prior”.

Once the neural network is trained, \(r_{\boldsymbol{\gamma}^*}(\boldsymbol{Z}, \boldsymbol{\theta}) \equiv c_{\boldsymbol{\gamma}^*}(\boldsymbol{Z}, \boldsymbol{\theta})\{1 - c_{\boldsymbol{\gamma}^*}(\boldsymbol{Z}, \boldsymbol{\theta})\}^{-1}\), \(\boldsymbol{Z} \in \mathscr{Z}, \boldsymbol{\theta} \in \Theta\), may be used to quickly approximate the likelihood-to-evidence ratio, and therefore it is called a neural ratio estimator.

Inference based on a neural ratio estimator may proceed in a frequentist setting via maximum likelihood and likelihood ratios (e.g., Walchessen et al., 2024), and in a Bayesian setting by facilitating the computation of transition probabilities in Hamiltonian Monte Carlo and MCMC algorithms (e.g., Hermans et al., 2020). Further, an approximate posterior distribution can be obtained via the identity \({p(\boldsymbol{\theta} \mid \boldsymbol{Z})} = \pi(\boldsymbol{\theta}) r(\boldsymbol{\theta}, \boldsymbol{Z})\), and sampled from using standard sampling techniques (e.g., Thomas et al., 2022).

2 Package overview

The R interface to NeuralEstimators aims to facilitate the use of neural inferential methods for statisticians familiar with R. This is achieved by leveraging the R package JuliaConnectoR (Lenz et al., 2022), which enables seamless communication between R and Julia. The core functions of NeuralEstimators, such as training and assessing the neural networks, are wrapped in R functions to make them accessible (and user-friendly) within the R environment. The only step that requires users to write Julia code is the construction of the neural network architecture. This design choice ensures flexibility for defining custom architectures.

The typical workflow when using the R interface to NeuralEstimators is as follows, with the steps requiring Julia code explicitly highlighted:

  1. Sample parameters \(\boldsymbol{\theta}\) (from the prior or proposal distribution) to form training/validation/test parameter sets. Alternatively, define a function to sample parameters dynamically during training. Parameters are typically stored as \(d \times K\) matrices, where \(d\) is the dimension of \(\boldsymbol{\theta}\) and \(K\) is the number of parameter vectors in the given parameter set, though any batchable object is supported.
  2. Simulate data from the model conditional on the above parameter sets, to form training/validation/test data sets. Simulated data sets are stored as batches in a format amenable to the chosen neural-network architecture (see Step 3).
  3. (Julia) Construct a neural network that maps \(K\) data sets to a \(d^* \times K\) matrix of summary statistics for \(\boldsymbol{\theta}\), where \(d^*\) is a user-specified number of learned summary statistics. The architecture class (e.g., MLP, CNN, GNN, DeepSet) should reflect the structure of the data (e.g., unstructured, grid, graph, replicated). Any Flux.jl or Lux.jl model can be used.
  4. (Julia) Construct a NeuralEstimator by wrapping the neural network in the subtype corresponding to the intended inferential method (PointEstimator, PosteriorEstimator, RatioEstimator).
  5. Train the NeuralEstimator using train() and the training set, monitoring performance and convergence using the validation set.
  6. Assess the NeuralEstimator using assess() and the test set.
  7. Apply the NeuralEstimator to observed data using estimate().

Once the NeuralEstimator has passed our assessments and is deemed to be well calibrated, it may be used to make inference with observed data.

3 Examples

We now consider examples where the data are univariate (Section 3.1), multivariate and unstructured (Section 3.2), or collected over a regular grid (Section 3.3). These examples focus on neural Bayes estimation using a PointEstimator: for a list of other estimator types available in the package (e.g., for neural posterior or ratio estimation), see here.

Before proceeding, we load the required packages: see the README for instructions on installing Julia and the Julia packages NeuralEstimators and Flux. If you have access to an NVIDIA graphics processing unit (GPU), you can utilise it by simply adding CUDA to the list of Julia packages in the final line below:

library("NeuralEstimators")
library("JuliaConnectoR")
library("ggplot2")
library("ggpubr") 
library("latex2exp")
juliaEval('using NeuralEstimators, Flux') 
#> Starting Julia ...

3.1 Univariate data

Here, we develop a neural Bayes estimator for \(\boldsymbol{\theta} \equiv (\mu, \sigma)'\) from data \(\boldsymbol{Z} \equiv (Z_1, \dots, Z_m)'\), where each \(Z_i \overset{\mathrm{iid}}\sim \mathrm{Gau}(\mu, \sigma^2)\).

First, we sample parameters from the prior to construct parameter sets used for training and validating the estimator. Here, we use the priors \(\mu \sim \rm{Gau}(0, 1)\) and \(\sigma \sim \rm{Gamma}(1, 1)\), and we assume that the parameters are independent a priori. The sampled parameters are stored as \(d \times K\) matrices, with \(d\) the number of parameters in the model and \(K\) the number of sampled parameter vectors.

# Sampling from the prior 
# K: number of samples to draw from the prior
sampler <- function(K) {
  mu    <- rnorm(K)
  sigma <- rgamma(K, 1)
  Theta <- matrix(c(mu, sigma), byrow = TRUE, ncol = K)
  return(Theta)
}
K <- 10000
theta_train <- sampler(K)
theta_val   <- sampler(K/10)

Next, we define the statistical model implicitly through data simulation. The simulated data are stored as a list, where each element of the list corresponds to a data set simulated conditionally on one parameter vector. The storage type of each data set depends on the multivariate structure of the data (e.g., a matrix for unstructured multivariate data, a multidimensional array for gridded data). In this example, each replicate \(Z_1, \dots, Z_m\) is univariate, so the data are stored as a matrix with \(n = 1\) row and \(m\) columns:

# Marginal simulation from the statistical model
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
simulate <- function(Theta, m) {
  apply(Theta, 2, function(theta) t(rnorm(m, theta[1], theta[2])), simplify = FALSE)
}

m <- 30
Z_train <- simulate(theta_train, m)
Z_val   <- simulate(theta_val, m)

Let’s visualise a few elements of the training set:

mu    <- theta_train[1, 1:4]
sigma <- theta_train[2, 1:4]
Z     <- Z_train[1:4]

df <- Map(function(z, m, s) {
  data.frame(Z = c(z), mu = m, sigma = s)
  }, Z, mu, sigma)
df <- do.call(rbind, df)

df$theta <- paste0("$\\mu$ = ", round(df$mu, 3), ", $\\sigma$ = ", round(df$sigma, 3))
df$theta <- as.factor(df$theta)
levels(df$theta) <- sapply(levels(df$theta), TeX)

ggplot(df) + 
  geom_histogram(aes(x = Z), bins = 30, fill = "blue", alpha = 0.5, color = "black") +
  facet_grid(~theta, labeller = label_parsed) +
  theme_bw()

We now design our neural network.

As we are constructing a neural Bayes estimator, the neural network is a mapping \(\mathscr{Z}\to\Theta\), and the dimensionality of the neural-network output is therefore \(d \equiv \textrm{dim}(\Theta) = 2\).

Since our data are replicated, we adopt the DeepSets framework, implemented in NeuralEstimators with the type DeepSet. DeepSets consist of two neural networks: an inner network and an outer network. The inner network extracts summary statistics from the data, and its architecture depends on the multivariate structure of the data. For unstructured data (i.e., data without spatial or temporal correlation within each replicate), we use a multilayer perceptron (MLP). The input dimension matches the dimensionality of each data replicate, while the output dimension corresponds to the number of summary statistics appropriate for the model (a common choice is \(d\)). The outer network maps the learned summary statistics to the output space (here, the parameter space, \(\Theta\)). The outer network is always an MLP.

Below is an example of a DeepSets architecture for neural Bayes estimation in this example. Note that many models have parameter constraints (e.g., variance and range parameters that must be strictly positive). These constraints can be incorporated in the final layer of the neural network by choosing appropriate activation functions for each parameter. Here, we enforce the constraint \(\sigma > 0\) by applying the softplus activation function in the final layer of the outer network, ensuring that all parameter estimates are valid. For some additional ways to constrain parameter estimates, see Output layers.

# Initialise the estimator
estimator <- juliaEval('
  n = 1    # dimension of each data replicate (univariate)
  d = 2    # dimension of the parameter vector θ
  w = 128  # width of each hidden layer 
  
  # Final layer has output dimension d and enforces parameter constraints
  final_layer = Parallel(
      vcat,
      Dense(w, 1, identity),     # mean parameter: no constraints
      Dense(w, 1, softplus)      # standard-deviation parameter: strictly positive
  )

  # Construct inner and outer networks and combine into DeepSet
  psi = Chain(Dense(n, w, relu), Dense(w, w, relu))    
  phi = Chain(Dense(w, w, relu), final_layer)          
  network = DeepSet(psi, phi)

  # Wrap the neural network as a PointEstimator
  estimator = PointEstimator(network)
')

Next, we train the estimator using train(), here using the default mean-absolute-error loss, so that the resulting neural Bayes estimator approximates the marginal posterior medians. We use the sets of parameters and simulated data constructed earlier; one may alternatively use the arguments sampler and simulator to pass functions for sampling from the prior and simulating from the model, respectively, which facilitates the technique known as “simulation on-the-fly”. These functions can be defined in R (as we have done in this example) or in Julia using the package JuliaConnectoR (see the help file for train()).

# Train the estimator
estimator <- train(
  estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = Z_train,
  Z_val   = Z_val
  )
#> Constructing the training set...
#> Constructing the validation set...
#> Computing the initial validation risk... Initial validation risk = 0.74764067
#> Epoch:   1  Training risk: 0.282  Validation risk: 0.199  Learning rate: 5.00E-04  Epoch time: 20.288 seconds
#> Epoch:   2  Training risk: 0.184  Validation risk: 0.166  Learning rate: 5.00E-04  Epoch time: 2.816 seconds
#> Epoch:   3  Training risk: 0.165  Validation risk: 0.155  Learning rate: 5.00E-04  Epoch time: 2.814 seconds
#> Epoch:   4  Training risk: 0.156  Validation risk: 0.15  Learning rate: 5.00E-04  Epoch time: 2.828 seconds
#> Epoch:   5  Training risk: 0.151  Validation risk: 0.153  Learning rate: 4.99E-04  Epoch time: 2.779 seconds
#> Epoch:   6  Training risk: 0.148  Validation risk: 0.14  Learning rate: 4.98E-04  Epoch time: 2.782 seconds
#> Epoch:   7  Training risk: 0.146  Validation risk: 0.14  Learning rate: 4.97E-04  Epoch time: 2.937 seconds
#> Epoch:   8  Training risk: 0.143  Validation risk: 0.138  Learning rate: 4.96E-04  Epoch time: 2.838 seconds
#> Epoch:   9  Training risk: 0.141  Validation risk: 0.135  Learning rate: 4.94E-04  Epoch time: 2.99 seconds
#> Epoch:  10  Training risk: 0.14  Validation risk: 0.145  Learning rate: 4.92E-04  Epoch time: 2.795 seconds
#> Epoch:  11  Training risk: 0.139  Validation risk: 0.135  Learning rate: 4.90E-04  Epoch time: 2.765 seconds
#> Epoch:  12  Training risk: 0.138  Validation risk: 0.135  Learning rate: 4.88E-04  Epoch time: 2.797 seconds
#> Epoch:  13  Training risk: 0.138  Validation risk: 0.134  Learning rate: 4.85E-04  Epoch time: 2.756 seconds
#> Epoch:  14  Training risk: 0.137  Validation risk: 0.133  Learning rate: 4.82E-04  Epoch time: 2.599 seconds
#> Epoch:  15  Training risk: 0.136  Validation risk: 0.134  Learning rate: 4.79E-04  Epoch time: 2.771 seconds
#> Epoch:  16  Training risk: 0.135  Validation risk: 0.137  Learning rate: 4.76E-04  Epoch time: 2.731 seconds
#> Epoch:  17  Training risk: 0.136  Validation risk: 0.132  Learning rate: 4.73E-04  Epoch time: 2.789 seconds
#> Epoch:  18  Training risk: 0.134  Validation risk: 0.131  Learning rate: 4.69E-04  Epoch time: 2.72 seconds
#> Epoch:  19  Training risk: 0.135  Validation risk: 0.132  Learning rate: 4.65E-04  Epoch time: 2.528 seconds
#> Epoch:  20  Training risk: 0.134  Validation risk: 0.131  Learning rate: 4.61E-04  Epoch time: 2.708 seconds
#> Epoch:  21  Training risk: 0.134  Validation risk: 0.128  Learning rate: 4.57E-04  Epoch time: 2.776 seconds
#> Epoch:  22  Training risk: 0.133  Validation risk: 0.131  Learning rate: 4.52E-04  Epoch time: 2.735 seconds
#> Epoch:  23  Training risk: 0.133  Validation risk: 0.129  Learning rate: 4.48E-04  Epoch time: 2.789 seconds
#> Epoch:  24  Training risk: 0.132  Validation risk: 0.131  Learning rate: 4.43E-04  Epoch time: 2.707 seconds
#> Epoch:  25  Training risk: 0.131  Validation risk: 0.127  Learning rate: 4.38E-04  Epoch time: 2.589 seconds
#> Epoch:  26  Training risk: 0.132  Validation risk: 0.127  Learning rate: 4.32E-04  Epoch time: 2.707 seconds
#> Epoch:  27  Training risk: 0.131  Validation risk: 0.13  Learning rate: 4.27E-04  Epoch time: 2.748 seconds
#> Epoch:  28  Training risk: 0.132  Validation risk: 0.127  Learning rate: 4.21E-04  Epoch time: 2.752 seconds
#> Epoch:  29  Training risk: 0.131  Validation risk: 0.127  Learning rate: 4.15E-04  Epoch time: 3.204 seconds
#> Epoch:  30  Training risk: 0.13  Validation risk: 0.13  Learning rate: 4.09E-04  Epoch time: 3.007 seconds
#> Epoch:  31  Training risk: 0.13  Validation risk: 0.126  Learning rate: 4.03E-04  Epoch time: 2.824 seconds
#> Epoch:  32  Training risk: 0.131  Validation risk: 0.125  Learning rate: 3.97E-04  Epoch time: 2.646 seconds
#> Epoch:  33  Training risk: 0.13  Validation risk: 0.128  Learning rate: 3.91E-04  Epoch time: 2.844 seconds
#> Epoch:  34  Training risk: 0.129  Validation risk: 0.127  Learning rate: 3.84E-04  Epoch time: 2.778 seconds
#> Epoch:  35  Training risk: 0.13  Validation risk: 0.126  Learning rate: 3.77E-04  Epoch time: 2.75 seconds
#> Epoch:  36  Training risk: 0.129  Validation risk: 0.126  Learning rate: 3.70E-04  Epoch time: 2.876 seconds
#> Epoch:  37  Training risk: 0.129  Validation risk: 0.126  Learning rate: 3.63E-04  Epoch time: 2.973 seconds
#> Epoch:  38  Training risk: 0.129  Validation risk: 0.127  Learning rate: 3.56E-04  Epoch time: 2.833 seconds
#> Stopping early since the validation loss has not improved in 5 epochs
#> Finished training in 124.2119135 seconds

If the argument savepath in train() is specified, the neural-network state (e.g., the weights and biases) will be saved during training as bson files, and the best state (as measured by validation risk) will be saved as best_network.bson. To load these saved states into a neural network in later R sessions, one may use the function loadstate(). Note that one may also manually save a trained estimator using savestate().

Once a neural Bayes estimator has been trained, its performance should be assessed. Simulation-based (empirical) methods for evaluating the performance of a neural Bayes estimator are ideal, since simulation is already required for constructing the estimator, and because the estimator can be applied to thousands of simulated data sets at almost no computational cost.

To assess the accuracy of the resulting neural Bayes estimator, one may use the function assess(). The resulting object can then be passed to the functions risk(), bias(), and rmse() to compute a Monte Carlo approximation of the Bayes risk, the bias, and the RMSE, or passed to the function plotestimates() to visualise the estimates against the corresponding true values:

# Assess the estimator
theta_test <- sampler(1000)
Z_test     <- simulate(theta_test, m)
assessment <- assess(estimator, theta_test, Z_test, estimator_names = "NBE")
plotestimates(assessment, parameter_labels = c("θ1" = expression(mu), "θ2" = expression(sigma)))

Once the neural Bayes estimator has been trained and assessed, it can be applied to observed data using the function estimate(), and non-parametric bootstrap estimates can be computed using the function bootstrap(). Below, we use simulated data as a surrogate for observed data:

theta    <- as.matrix(c(0, 0.5))     # true parameters
Z        <- simulate(theta, m)       # pretend that this is observed data
estimate(estimator, Z)               # point estimate from the "observed data"
#>             [,1]
#> [1,] -0.04679317
#> [2,]  0.48535904
bootstrap(estimator, Z)[, 1:6]       # non-parametric bootstrap estimates
#>            [,1]      [,2]       [,3]       [,4]       [,5]        [,6]
#> [1,] -0.1178385 0.1237657 -0.1555784 -0.1538627 0.09082093 0.005188575
#> [2,]  0.5093642 0.4116983  0.4775196  0.4497076 0.49067265 0.481571317

3.2 Multivariate data

Suppose now that our data consists of \(m\) replicates of a \(d\)-dimensional multivariate distribution.

For unstructured \(d\)-dimensional data, the estimator is based on a multilayer perceptron (MLP), and each data set is stored as a two-dimensional array (a matrix), with the second dimension storing the independent replicates. That is, in this case, we store the data as a list of \(d \times m\) matrices, and the inner network of the DeepSets representation has \(d\) neurons in the input layer.

For example, consider the task of estimating \(\boldsymbol{\theta} \equiv (\mu_1, \mu_2, \sigma, \rho)'\) from data \(\boldsymbol{Z}_1, \dots, \boldsymbol{Z}_m\) that are independent and identically distributed according to the bivariate Gaussian distribution, \[ \rm{Gau}\left(\begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}, \sigma^2\begin{bmatrix} 1 & \rho \\ \rho & 1\end{bmatrix}\right). \] Then, to construct a neural Bayes estimator for this model, one may use the following code for defining a prior, the data simulator, and the neural-network architecture:

# Sampling from the prior 
# K: number of samples to draw from the prior
sampler <- function(K) {
  mu1    <- rnorm(K)
  mu2    <- rnorm(K)
  sigma  <- rgamma(K, 1)
  rho    <- runif(K, -1, 1)
  theta  <- matrix(c(mu1, mu2, sigma, rho), byrow = TRUE, ncol = K)
  return(theta)
}

# Marginal simulation from the statistical model
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
simulate <- function(Theta, m) {
  apply(Theta, 2, function(theta) {
    mu    <- c(theta[1], theta[2])
    sigma <- theta[3]
    rho   <- theta[4]
    Sigma <- sigma^2 * matrix(c(1, rho, rho, 1), 2, 2)
    Z <- MASS::mvrnorm(m, mu, Sigma)
    t(Z)
  }, simplify = FALSE)
}

# Initialise the estimator
estimator <- juliaEval('
  d = 2    # dimension of each replicate 
  w = 32   # number of neurons in each hidden layer
  
  # Layer to ensure valid estimates
  final_layer = Parallel(
      vcat,
      Dense(w, 2, identity), # mean parameters
      Dense(w, 1, softplus), # variance parameter
      Dense(w, 1, tanh)      # correlation parameter
    )

  psi = Chain(Dense(d, w, relu), Dense(w, w, relu), Dense(w, w, relu))
  phi = Chain(Dense(w, w, relu), Dense(w, w, relu), final_layer)
  deepset = DeepSet(psi, phi)
  estimator = PointEstimator(deepset)
')

Let’s visualise a few realisations from the current model:

theta <- sampler(3)
Z     <- simulate(theta, m = 250)
mu1   <- theta[1, 1:3]
mu2   <- theta[2, 1:3]
sigma <- theta[3, 1:3]
rho   <- theta[4, 1:3]

df <- Map(function(z, m1, m2, s, r) {
  data.frame(Z1 = z[1, ], Z2 = z[2, ], mu1 = m1, mu2 = m2, sigma = s, rho = r)
  }, Z, mu1, mu2, sigma, rho)
df <- do.call(rbind, df)

df$theta <- paste0(
  "$\\mu_1$ = ", round(df$mu1, 2), ", ",
  "$\\mu_2$ = ", round(df$mu2, 2), ", ",
  "$\\sigma$ = ", round(df$sigma, 2), ", ",
  "$\\rho$ = ", round(df$rho, 2)
  )
df$theta <- as.factor(df$theta)
levels(df$theta) <- sapply(levels(df$theta), TeX)

ggplot(df) + 
  geom_point(aes(x = Z1, y = Z2), alpha = 0.5) +
  facet_grid(~theta, labeller = label_parsed) +
  theme_bw()

The training, assessment, and application of the neural Bayes estimator then remains as in the example above:

# Construct training and validation data sets
K <- 5000
m <- 250
theta_train <- sampler(K)
theta_val   <- sampler(K/10)
Z_train <- simulate(theta_train, m)
Z_val   <- simulate(theta_val, m)

# Train the estimator
estimator <- train(
  estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = Z_train,
  Z_val   = Z_val, 
  epochs = 20
  )
#> Constructing the training set...
#> Constructing the validation set...
#> Computing the initial validation risk... Initial validation risk = 0.73066986
#> Epoch:  1  Training risk: 0.544  Validation risk: 0.328  Learning rate: 5.00E-04  Epoch time: 5.964 seconds
#> Epoch:  2  Training risk: 0.253  Validation risk: 0.212  Learning rate: 5.00E-04  Epoch time: 2.672 seconds
#> Epoch:  3  Training risk: 0.201  Validation risk: 0.195  Learning rate: 4.97E-04  Epoch time: 2.688 seconds
#> Epoch:  4  Training risk: 0.189  Validation risk: 0.187  Learning rate: 4.88E-04  Epoch time: 2.941 seconds
#> Epoch:  5  Training risk: 0.181  Validation risk: 0.176  Learning rate: 4.73E-04  Epoch time: 2.683 seconds
#> Epoch:  6  Training risk: 0.172  Validation risk: 0.17  Learning rate: 4.52E-04  Epoch time: 3.037 seconds
#> Epoch:  7  Training risk: 0.167  Validation risk: 0.166  Learning rate: 4.27E-04  Epoch time: 2.818 seconds
#> Epoch:  8  Training risk: 0.161  Validation risk: 0.159  Learning rate: 3.97E-04  Epoch time: 2.813 seconds
#> Epoch:  9  Training risk: 0.157  Validation risk: 0.157  Learning rate: 3.63E-04  Epoch time: 2.579 seconds
#> Epoch: 10  Training risk: 0.153  Validation risk: 0.154  Learning rate: 3.27E-04  Epoch time: 2.626 seconds
#> Epoch: 11  Training risk: 0.151  Validation risk: 0.149  Learning rate: 2.89E-04  Epoch time: 2.787 seconds
#> Epoch: 12  Training risk: 0.149  Validation risk: 0.15  Learning rate: 2.50E-04  Epoch time: 2.892 seconds
#> Epoch: 13  Training risk: 0.147  Validation risk: 0.149  Learning rate: 2.11E-04  Epoch time: 2.68 seconds
#> Epoch: 14  Training risk: 0.145  Validation risk: 0.147  Learning rate: 1.73E-04  Epoch time: 2.724 seconds
#> Epoch: 15  Training risk: 0.144  Validation risk: 0.147  Learning rate: 1.37E-04  Epoch time: 2.61 seconds
#> Epoch: 16  Training risk: 0.143  Validation risk: 0.145  Learning rate: 1.03E-04  Epoch time: 2.728 seconds
#> Epoch: 17  Training risk: 0.143  Validation risk: 0.144  Learning rate: 7.32E-05  Epoch time: 2.611 seconds
#> Epoch: 18  Training risk: 0.142  Validation risk: 0.144  Learning rate: 4.77E-05  Epoch time: 2.82 seconds
#> Epoch: 19  Training risk: 0.142  Validation risk: 0.143  Learning rate: 2.72E-05  Epoch time: 2.741 seconds
#> Epoch: 20  Training risk: 0.142  Validation risk: 0.143  Learning rate: 1.22E-05  Epoch time: 2.8 seconds
#> Finished training in 58.464254333 seconds

# Assess the estimator
theta_test <- sampler(1000)
Z_test     <- simulate(theta_test, m)
assessment <- assess(estimator, theta_test, Z_test, 
                     estimator_names = "NBE", 
                     parameter_names = c("mu1", "mu2", "sigma", "rho")) 
plotestimates(assessment)

3.3 Gridded data

For data collected over a regular grid, neural estimators are typically based on a convolutional neural network (CNN). We give specific attention to this case in a separate vignette. There, we also outline how to perform neural Bayes estimation with incomplete/missing data based on methods described by Wang et al. (2024) and Sainsbury-Dale et al. (2025b).