psqn: Partially Separable Quasi-Newton

\[\renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\diag{\text{diag}} \def\argmin{\text{arg}\,\text{min}} \def\Expe{\text{E}}\]

This package provides an optimization method for partially separable functions. Partially separable functions are of the following form:

\[\vec x_{\mathcal I_i} = (\vec e_{j_{i1}}^\top, \dots ,\vec e_{j_{im_i}}^\top)\vec x, \qquad \mathcal I_i = (j_{i1}, \dots, \mathcal j_{im_i}) \subseteq \{1, \dots, l\},\] and $\vec e_k$ is the $k$’th column of the $l$ dimensional identity matrix. Each function $f_i$ is called an element function and only depends on $m_i \ll l$ parameters. This allows for an efficient quasi-Newton method when all the $m_i$’s are much smaller than the dimension of the parameter vector $\vec x$, $l$. The framework can be extended to allow for a linear combination of parameters but we do not cover such problems. This vignette closely follows Nocedal and Wright (2006) who cover the methods and alternatives in much greater detail.

We first focus on a more restricted form of the problem. See the section called Generic Example for the more general interface provided by this package. Assume that each index set $\mathcal I_i$ is of the form:

\[\begin{align*} \mathcal I_i &= \{1,\dots, p\} \cup \mathcal J_i \\ \mathcal J_i \cap \mathcal J_j &= \emptyset \qquad j\neq i \\ \mathcal J_i \cap \{1,\dots, p\} &= \emptyset \qquad \forall i = 1,\dots, n \end{align*}.\]

That is, each index set contains $p$ global parameters and $q_i = \lvert\mathcal J_i\rvert$ private parameters which are particular for each element function, $f_i$. For implementation reason, we let:

\[\begin{align*} \overleftarrow q_i &= \begin{cases} p & i = 0 \\ p + \sum_{k = 1}^i q_k & i > 0 \end{cases} \\ \mathcal J_i &= \{1 + \overleftarrow q_{i - 1}, \dots , q_i + \overleftarrow q_{i - 1}\} \end{align*}\]

such that the element functions’ private parameters lies in consecutive parts of $\vec x$.

Example

We are going to consider a Taylor approximation for a generalized linear mixed model. In particular, we focus on a mixed logit regression where:

\[\begin{align*} \vec U_i &\sim N^{(r)}(\vec 0, \mat\Sigma) \\ \vec\eta_i &= \mat X_i\vec\beta + \mat Z_i\vec U_i \\ Y_{ij} &\sim \text{Bin}(\text{logit}^{-1}(\eta_{ij}), 1), \qquad j = 1, \dots, t_i \end{align*}\]

where $N^{(r)}(\vec\mu,\mat\Sigma)$ means a $r$-dimensional a multivariate normal distribution with mean $\vec\mu$ and covariance matrix $\mat\Sigma$ and $\text{Bin}(p, k)$ means a binomial distribution probability $p$ and size $k$. $\vec U_i$ is an unknown random effect with an unknown covariance $\mat\Sigma$ and $\vec\beta\in\mathbb{R}^p$ are unknown fixed effect coefficients. $\mat X_i$ and $\mat Z_i$ are known design matrices each with $t_i$ rows for each of the $t_i$ observed outcomes, the $y_{ij}$s.

As part of a Taylor approximation, we find a mode of $\vec x = (\vec\beta^\top, \widehat{\vec u}_1^\top, \dots, \widehat{\vec u}_n^\top)$ of the log of the integrand given a covariance matrix estimate, $\widehat{\mat \Sigma}$. That is, we are minimizing:

\[\begin{align*} f(\vec x) &= -\sum_{i = 1}^n \left( \sum_{k = 1}^{t_i}(y_{ij}\eta_{ij} - \log(1 + \exp\eta_{ij})) - \frac 12 \widehat{\vec u}_i^\top\widehat{\mat \Sigma}^{-1} \widehat{\vec u}_i \right) \\ &= -\sum_{i = 1}^n \left( \vec y_i(\mat X_i\vec\beta + \mat Z_i\widehat{\vec u}_i) - \sum_{k = 1}^{t_i} \log(1 + \exp(\vec x_{ik}^\top\vec\beta + \vec z_{ik}^\top\widehat{\vec u}_i)) - \frac 12 \widehat{\vec u}_i^\top\widehat{\mat \Sigma}^{-1} \widehat{\vec u}_i \right) \\ &= \sum_{i = 1}^nf_i((\vec\beta^\top, \widehat{\vec u}_i^\top)^\top) \\ f_i((\vec\beta^\top, \vec u^\top)^\top) &= -\vec y_i(\mat X_i\vec\beta + \mat Z_i\vec u) + \sum_{k = 1}^{t_i} \log(1 + \exp(\vec x_{ik}^\top\vec\beta + \vec z_{ik}^\top\vec u)) + \frac 12 \vec u^\top\widehat{\mat \Sigma}^{-1} \vec u \end{align*}\]

In this problem, $\vec\beta$ are the global parameters and the $\widehat{\vec u}_i$’s are the private parameters. Thus, $l = p + nr$. We will later return to this example with an implementation which uses this package.

Variational Approximations

The objective function for variational approximations for mixed models for clustered data is commonly also partially separable. We will briefly summarize the idea here. Ormerod and Wand (2012) and Ormerod (2011) are examples where one might benefit from using the methods in this package.

We let $\tilde f_i$ be the log marginal likelihood term from cluster $i$. This is of the form:

\[ \tilde f_i(\vec\omega) = \log \int p_i(\vec y_i, \vec u;\vec\omega)\der \vec u \]

where $\vec\omega$ are unknown model parameters, $p_i(\vec u;\vec\omega)$ is the joint density of the observed data denoted by $\vec y_i$, and $\vec U_i$ which is a cluster specific random effect. $\exp \tilde f_i(\vec\omega)$ is often intractable. An approximation of $\tilde f_i$ is to select some variational distribution denoted by $v_i$ parameterized by some set $\Theta_i$. We then use the approximation:

\[ \begin{align*} \tilde f_i(\vec\omega) &= \int v_i(\vec u; \vec\theta_i) \log\left( \frac{p_i(\vec y_i \vec u;\vec\omega)/v_i(\vec u; \vec\theta_i)} {p_i(\vec u_i \mid \vec y_i;\vec\omega)/v_i(\vec u; \vec\theta_i)} \right)\der\vec u \\ &= \int v_i(\vec u; \vec\theta_i) \log\left( \frac{p_i(\vec y_i, \vec u;\vec\omega)} {v_i(\vec u; \vec\theta_i)} \right)\der\vec u + \int v_i(\vec u; \vec\theta_i) \log\left( \frac{v_i(\vec u; \vec\theta_i)} {p_i(\vec u \mid \vec y_i;\vec\omega)} \right)\der\vec u \\ &\geq \int v_i(\vec u; \vec\theta_i) \log\left( \frac{p_i(\vec y_i, \vec u;\vec\omega)} {v_i(\vec u; \vec\theta_i)} \right)\der\vec u = f_i(\vec\omega,\vec\theta_i) \end{align*} \]

where $\vec\theta_i\in\Theta_i$ and $p_i(\vec u_i \mid \vec y_i;\vec\omega)$ is the conditional density of the random effect given the observed data, $\vec y_i$, and model parameters, $\vec\omega$. $f_i(\vec\omega,\vec\theta_i)$ is a lower bound since the Kullback–Leibler divergence

\[ \int v_i(\vec u; \vec\theta_i)\log\left( \frac{v_i(\vec u; \vec\theta_i)} {p_i(\vec u \mid \vec y_i;\vec\omega)} \right)\der\vec u \]

is positive. The idea is to replace the minimization problem:

\[ \argmin_{\vec\omega} -\sum_{i = 1}^n \tilde f_i(\vec\omega) \]

with a variational approximation:

\[ \argmin_{\vec\omega,\vec\theta_1,\dots,\vec\theta_n} -\sum_{i = 1}^n f_i(\vec\omega,\vec\theta_i) \]

This problem fits into the framework in the package where $\vec\omega$ are the global parameters and the $\vec\theta_i$s are the private parameters.

Variational approximation have the property that if $v_i(\vec u; \vec\theta_i) = p_i(\vec u \mid \vec y_i;\vec\omega)$ then the Kullback–Leibler divergence is zero and the lower bound is equal to the log marginal likelihood. Thus, we need to use a family of variational distributions, $v_i$, which yields a close approximation of the conditional density of the random effects, $p_i(\vec u \mid \vec y_i;\vec\omega)$, for some $\vec\theta_i\in\Theta_i$. Moreover, the lower bound also needs to be easy to optimize. Variational approximations have an advantage that given estimates of $\widehat{\vec\omega},\widehat{\vec\theta}_1,\dots,\widehat{\vec\theta}_n$ then subsequent inference can be approximated using:

\[ \Expe\left(h(\vec U_i)\right) = \int h(\vec u) p_i(\vec u \mid \vec y_i;\vec\omega)\der\vec u \approx \int h(\vec u) v_i(\vec u; \widehat{\vec\theta}_i)\der\vec u. \]

The latter integral may be much easier to work with for some functions $h$ and variational distribution, $v_i$.

Quasi-Newton Method for Partially Separable Functions

We are going to assume some prior knowledge of Newton’s method and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm and we only provide a few details of these methods. However, will need a bit of notations from these methods to motivate the quasi-Newton method we have implemented.

Newton’s method to minimize a function is to start at some value $\vec x_0$. Then we set $k = 1$ and

compute a direction $\vec p_k$ given by \[\nabla^2 f(\vec x_{k - 1})\vec p_k = - \nabla f(\vec x_{k -1}),\]
set $\vec x_k = \vec x_{k - 1} + \vec p_k$ or $\vec x_k = \vec x_{k - 1} + \gamma\vec p_k$ for $\gamma \in (0, 1]$ set to satisfy the Wolfe conditions, and
repeat with $k\leftarrow k + 1$ if a convergence criterion is not satisfied.

Computing the Hessian, $\nabla^2 f(\vec x_{k - 1})$, at every iteration can be expensive. The BFGS algorithm offers an alternative where we use an approximation instead. Here we start with some Hessian approximation $\mat B_0$ and

compute a direction $\vec p_k$ given by \[\mat B_{k - 1}\vec p_k = - \nabla f(\vec x_{k -1}),\]
find a step size $\alpha$ such that $\vec x_{k - 1} + \alpha\vec p_k$ satisfy the Wolfe conditions,
set $\vec x_k = \vec x_{k - 1} + \alpha\vec p_k$, $\vec s_k = \alpha\vec p_k = \vec x_k - \vec x_{k - 1}$, $\vec d_k = \nabla f(\vec x_k) - \nabla f(\vec x_{k - 1})$,
perform a rank-two update \[\mat B_k = \mat B_{k - 1} + \frac{\vec y_k\vec y_k^\top}{\vec y_k^\top\vec s_k} - \frac{\mat B_{k - 1}\vec s_k\vec s_k^\top\mat B_{k - 1}^\top}{\vec s_k^\top\mat B_{k - 1}\vec s_k},\] and
repeat with $k\leftarrow k + 1$ if a convergence criterion is not satisfied.

This reduces the cost of computing the Hessian. Further, we can update $\mat B_k^{-1}$ to avoid solving $\mat B_{k - 1}\vec p_k = - \nabla f(\vec x_{k -1})$. The matrix $\mat B_k^{-1}$ will still be large and dense when $l$ is large.

Using Partial Separability

As an alternative, we can exploit the structure of the problem we are solving. Let

\[\mat H_i = (\vec e_{j_{i1}}^\top, \dots ,\vec e_{j_{im_i}}^\top).\]

The true Hessian in our case is sparse and given by

\[\nabla^2 f(\vec x) = \sum_{i = 1}^n \mat H_i^\top\nabla^2f_i(\vec x_{\mathcal I_i})\mat H_i\]

Notice that each $\nabla^2f_i(\vec x_{\mathcal I_i})$ is only a $(p + q_i)\times (p + q_i)$ matrix. We illustrate this below with $n = 10$ element functions. Each plot is $\mat H_i^\top\nabla^2f_i(\vec x_{\mathcal I_i})\mat H_i$ where black entries are a non-zero.

The whole Hessian is:

We can use the partial separability to implement a BFGS method where we make $n$ BFGS approximations, one for each element function, $f_i$. Let $\mat B_{ki}$ be the approximation of $\nabla^2f_i(\vec x_{\mathcal I_i})$ at iteration $k$. Then the method we have implemented starts with $\mat B_{k1},\dots,\mat B_{kn}$ and

computes a direction $\vec p_k$ given by \[\left(\sum_{i = 1}^n\mat H_i^\top\mat B_{k - 1,i}\mat H_i\right)\vec p_k = - \nabla f(\vec x_{k -1}),\]
finds a step size $\alpha$ such that $\vec x_{k - 1} + \alpha\vec p_k$ satisfy the Wolfe conditions,
sets $\vec x_k = \vec x_{k - 1} + \alpha\vec p_k$,
performs BFGS updates for each $\mat B_{k1},\dots,\mat B_{kn}$, and
repeats with $k\leftarrow k + 1$ if a convergence criterion is not satisfied.

This seems as if it is going to be much slower as we are solving a large linear system if $l$ is large. However, we can use the conjugate gradient method we describe in the next section. This will be fast if we can perform the following matrix-vector product fast:

\[\left(\sum_{i = 1}^n\mat H_i^\top\mat B_{k - 1,i}\mat H_i\right)\vec z.\]

To elaborate on this, each $\mat H_i^\top\mat B_{k - 1,i}\mat H_i\vec z$ consists of matrix-vector product with a $o_i \times o_i$ symmetric matrix and a vector where $o_i = (p + q_i)$. This can be done in $2o_i(o_i + 1)$ flops. Thus, the total cost is $2\sum_{i = 1}^n o_i(o_i + 1)$ flops. This is in contrast to the original $2l(l + 1)$ flops with the BFGS method.

As an example suppose that $q_i = 5$ for all $n$ element functions, $n = 5000$, and $p = 10$. Then $o_i = 15$ and the matrix-vector product above requires $2\cdot 5000 \cdot 15(15 + 1) = 2400000$ flops. In contrast $l = 5000 \cdot 5 + 10 = 25010$ and the matrix-vector product in the BFGS method requires $2\cdot 25010 (25010 + 1) = 1251050220$ flops. That is 521 times more flops. Similar ratios are shown in the BFGS and Partially Separable Quasi-Newton section.

More formerly, the former is $\mathcal O(\sum_{i = 1}^n(p + q_{i})^2) = \mathcal O(np^2 + np\bar q + \sum_{i = 1}^nq_i^2)$ where $\bar q = \sum_{i = 1}^n q_i / n$ whereas the matrix-vector product in the BFGS method is $\mathcal O((p + n\bar q)^2) = \mathcal O(p^2 + pn\bar q + (n\bar q)^2)$. Thus, the former is favorable as long as $np^2 + \sum_{i = 1}^nq_i^2$ is small compared with $(n\bar q)^2$. Furthermore, the rank-two BFGS updates are cheaper and may converge faster to a good approximation. However, we should keep in mind that the original BFGS method yields an approximation of $\mat B_k^{-1}$. Thus, we do not need to solve a linear system. However, we may not need to take many conjugate gradient iterations to get a good approximation with the implemented quasi-Newton method.

Conjugate Gradient Method

The conjugate gradient method we use solves

\[\mat A\vec b = \vec v\]

which in our quasi-Newton method is

\[\left(\sum_{i = 1}^n\mat H_i^\top\mat B_{k - 1,i}\mat H_i\right)\vec p_k = - \nabla f(\vec x_{k -1})\]

We start of with some initial value $\vec x_0$. Then we set $k = 0$, $\vec r_0 = \mat A\vec x_0 - \vec v$, $\vec p_0 = -\vec r_0$, and:

find the step length \[\alpha_k = \frac{\vec r_k^\top\vec r_k}{\vec p_k^\top\mat A\vec p_k},\]
find the new value \[\vec x_{k + 1} = \vec x_k + \alpha_k\vec p_k,\]
find the new residual \[\vec r_{k + 1} = \vec r_k + \alpha_k\mat A\vec p_k,\]
set $\beta_{k + 1} = (\vec r_k^\top\vec r_k)^{-1}\vec r_{k + 1}^\top\vec r_{k + 1}$,
set the new search direction to \[\vec p_{k + 1} = - \vec r_{k + 1} + \beta_{k + 1}\vec p_k,\] and
stop if $\vec r_{k + 1}^\top\vec r_{k + 1}$ is smaller. Otherwise set $k\leftarrow k + 1$ and repeat.

The main issue is the matrix-vector product $\mat A\vec p_k$ but as we argued in the previous section that this can be computed in $\mathcal O(\sum_{i = 1}^n(p + q_{i})^2)$ time. The conjugate gradient method will at most take $h$ iterations where $h$ is the number of rows and columns of $\mat A$. Moreover, if $\mat A$ only has $r < h$ distinct eigenvalues then we will at most make $r$ conjugate gradient iterations. Lastly, if $\mat A$ has clusters of eigenvalues then we may expect to perform only a number of iterations close to the number of distinct clusters.

In practice, we terminate the conjugate gradient method when $\lVert\vec r_k\rVert < \min (c, \sqrt{\lVert\nabla f(\vec x_{k -1})\rVert})\lVert\nabla f(\vec x_{k -1})\rVert$ where $c$ is a constant the user can set. Moreover, we the package supports diagonal preconditioning, incomplete Cholesky factorization preconditioning from Eigen, and for the psqn function there is also a preconditioning which is based on a block diagonal matrix which ignores the Hessian terms between the global parameters and the private parameters. The diagonal preconditioning is very cheap and may reduce the number of required conjugate gradient iterations. The incomplete Cholesky factorization preconditioning may greatly reduce the number of required conjugate gradient iterations at the cost of having to factorize the Hessian approximation. The block diagonal approach works very well for some problems and is usually much faster than Eigen.

We can compare the flops of the matrix product in BFGS with applying the conjugate gradient method. Assume that all $q_i$s are almost $\bar q$. Then the ratio of flops is approximately:

\[ \frac{p^2 + 2pn\bar q + (n\bar q)^2} {n_{\text{cg}}(np^2 + 2pn\bar q + n\bar q^2)} \]

where $n_{\text{cg}}$ is the number of conjugate gradient iterations. Thus, to get something which is linear in the number of element functions, $n$, then we must have that:

\[ \begin{align*} \frac{n_{\text{cg}}(np^2 + 2pn\bar q + n\bar q^2)} {p^2 + 2pn\bar q + (n\bar q)^2} &\leq \frac kn \\ \Leftrightarrow n_{\text{cg}} &\leq \frac kn \frac{p^2 + 2pn\bar q + (n\bar q)^2} {np^2 + 2pn\bar q + n\bar q^2} \\ &= k\frac{p^2 + 2pn\bar q + (n\bar q)^2} {n^2(p^2 + 2p\bar q + \bar q^2)} \\ &\approx k \frac{\bar q^2}{p^2 + 2p\bar q + \bar q^2} \end{align*} \]

where $k$ is some fixed constant with $k < n$. An example of the latter ratio is shown in the BFGS and Partially Separable Quasi-Newton section.

We can get rid of the $p^2$ in the denominator by once computing the first $p\times p$ part of the Hessian approximation prior to performing the conjugate gradient method. This is implemented. The max_cg argument is added because of the considerations above.

Line Search and Wolfe Condition

We use line search and search for a point which satisfy the strong Wolfe condition by default. The constants in the Wolfe condition can be set by the user. The line search is implemented as described by Nocedal and Wright (2006) with cubic interpolation in the zoom phase.

Symmetric Rank-one Updates

Symmetric rank-one (SR1) updates are implemented as an alternative to the BFGS updates. The user can set whether the SR1 updates should be used. The SR1 updates do not guarantee that the Hessian approximation is positive definite. Thus, the conjugate gradient method only proceeds if $\vec p_k^\top\mat A\vec p_k > 0$. That is, if the new direction is a descent direction.

Example Using the Implementation

We simulate a data set below from the mixed logit model we showed earlier.

# assign model parameters, number of random effects, and fixed effects
q <- 4 # number of private parameters per cluster
p <- 5 # number of global parameters
beta <- sqrt((1:p) / sum(1:p))
Sigma <- diag(q)

# simulate a data set
n_clusters <- 800L # number of clusters
set.seed(66608927)

sim_dat <- replicate(n_clusters, {
  n_members <- sample.int(20L, 1L) + 2L
  X <- matrix(runif(p * n_members, -sqrt(6 / 2), sqrt(6 / 2)), 
              p)
  u <- drop(rnorm(q) %*% chol(Sigma))
  Z <- matrix(runif(q * n_members, -sqrt(6 / 2 / q), sqrt(6 / 2 / q)), 
              q)
  eta <- drop(beta %*% X + u %*% Z)
  y <- as.numeric((1 + exp(-eta))^(-1) > runif(n_members))
  
  list(X = X, Z = Z, y = y, u = u, Sigma_inv = solve(Sigma))
}, simplify = FALSE)

# example of the first cluster
sim_dat[[1L]]
#> $X
#>         [,1]   [,2]    [,3]
#> [1,]  0.0416 -0.809 -0.1839
#> [2,]  0.6524 -1.373 -0.9254
#> [3,] -1.3339 -0.957 -0.8708
#> [4,]  0.7547 -0.156  0.0178
#> [5,]  0.7191 -0.681 -0.7232
#> 
#> $Z
#>        [,1]   [,2]   [,3]
#> [1,]  0.167 -0.483 -0.785
#> [2,] -0.266 -0.823  0.794
#> [3,]  0.609 -0.549  0.269
#> [4,] -0.414 -0.457  0.605
#> 
#> $y
#> [1] 0 0 0
#> 
#> $u
#> [1]  0.0705 -1.7285  0.1538 -0.3245
#> 
#> $Sigma_inv
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    0    0    0
#> [2,]    0    1    0    0
#> [3,]    0    0    1    0
#> [4,]    0    0    0    1

The combined vector with global and private parameters can be created like this (it is a misnoma to call this true_params as the modes of the random effects, the private parameters, should only match the random effects if the clusters are very large):

true_params <- c(beta, sapply(sim_dat, function(x) x$u))

# global parameters
true_params[1:p]
#> [1] 0.258 0.365 0.447 0.516 0.577

# some of the private parameters
true_params[1:(4 * q) + p]
#>  [1]  0.0705 -1.7285  0.1538 -0.3245  0.2516 -0.5419 -0.5537 -0.2805 -1.1777
#> [10] -1.7539  1.7338  0.5616 -0.8379  1.2412 -1.2046  1.4547

As a reference, we will create the following function to evaluate the log of the integrand:

eval_integrand <- function(par){
  out <- 0.
  inc <- p
  beta <- par[1:p]
  for(i in seq_along(sim_dat)){
    dat <- sim_dat[[i]]
    X <- dat$X
    Z <- dat$Z
    y <- dat$y
    Sigma_inv <- dat$Sigma_inv
    
    u <- par[1:q + inc]
    inc <- inc + q
    eta <- drop(beta %*% X + u %*% Z)
    
    out <- out - drop(y %*% eta) + sum(log(1 + exp(eta))) + 
      .5 * drop(u %*% Sigma_inv %*% u)
  }
  
  out
}

# check the log integrand at true global parameters and the random effects
eval_integrand(true_params)
#> [1] 6898

We will use this function to compare with our C++ implementation.

R Implementation

A R function which we need to pass to psqn to minimize the partially separable function is given below:

# evaluates the negative log integrand. 
# 
# Args:
#   i cluster/element function index. 
#   par the global and private parameter for this cluster. It has length 
#       zero if the number of parameters is requested.
#   comp_grad logical for whether to compute the gradient. 
r_func <- function(i, par, comp_grad){
  dat <- sim_dat[[i]]
  X <- dat$X
  Z <- dat$Z
  
  if(length(par) < 1)
    # requested the dimension of the parameter
    return(c(global_dim = NROW(dat$X), private_dim = NROW(dat$Z)))
  
  y <- dat$y
  Sigma_inv <- dat$Sigma_inv
  
  beta <- par[1:p]
  uhat <- par[1:q + p]
  eta <- drop(beta %*% X + uhat %*% Z)
  exp_eta <- exp(eta)
  
  out <- -sum(y * eta) + sum(log(1 + exp_eta)) + 
    sum(uhat * (Sigma_inv %*% uhat)) / 2
  if(comp_grad){
    d_eta <- -y + exp_eta / (1 + exp_eta)
    grad <- c(X %*% d_eta, 
              Z %*% d_eta + dat$Sigma_inv %*% uhat)
    attr(out, "grad") <- grad
  }
  
  out
}

Here is a check that the above yields the same as the function we defined before:

# check the function
r_func_val <- sum(sapply(1:n_clusters, function(i)
  r_func(i, true_params[c(1:p, 1:q + (i - 1L) * q + p)], FALSE)))
all.equal(eval_integrand(true_params), r_func_val)
#> [1] TRUE

# we could check the gradient like this
if(FALSE){
  r_func_gr <- numeric(length(true_params))
  for(i in seq_along(sim_dat)){
    out_i <- r_func(i, true_params[c(1:p, 1:q + (i - 1L) * q + p)], TRUE)
    r_func_gr[1:p] <- r_func_gr[1:p] + attr(out_i, "grad")[1:p]
    r_func_gr[1:q + (i - 1L) * q + p] <- attr(out_i, "grad")[1:q + p]
  }
  
  library(numDeriv)
  gr_num <- grad(function(par) eval_integrand(par), true_params)
  all.equal(r_func_gr, gr_num, tolerance = 1e-6)
}

The partially separable function can be minimized like this:

start_val <- true_params 
start_val[  1:p ] <- 
  start_val[  1:p ] + 
  c(0.49, -0.63, -0.4, -0.33, -0.38) # ~rnorm(length(beta), sd = .5)
start_val[-(1:p)] <- 0

library(psqn)
#> Loading required package: Matrix
r_psqn_func <- function(par, n_threads = 2L, c1 = 1e-4, 
                        c2 = .9, pre_method = 1L)
  psqn(par = par, fn = r_func, n_ele_func = n_clusters, 
       n_threads = n_threads, c1 = c1, c2 = c2, pre_method = pre_method)

R_res <- r_psqn_func(start_val)

We will later compare this with the result from the C++ implementation which we provide in the next section.

C++ Implementation

We provide a C++ implementation with the package as an example of how to use this package. The location of the implementation can be found by calling system.file("mlogit-ex.cpp", package = "psqn"). The most important part of the implementation is the problem specific m_logit_func class, the get_mlogit_optimizer function, and the optim_mlogit function which are needed to perform the optimization. The content of the file is:

// we will use OpenMP to perform the computation in parallel
// [[Rcpp::plugins(openmp, cpp11)]]

// we want to use the preconditioner with the Cholesky factorizations in the
// the diagonal. This requires that we link with BLAS and LAPACK. This is
// automatically done when we depend on RcppArmadillo
#define PSQN_W_LAPACK

// we use RcppArmadillo to simplify the code
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// we change the unsigned integer type that is used by the package by defining
// the PSQN_SIZE_T macro variable
#define PSQN_SIZE_T unsigned int

// we want to use the incomplete Cholesky factorization as the preconditioner
// and therefore with need RcppEigen
#define PSQN_USE_EIGEN
// [[Rcpp::depends(RcppEigen)]]

// [[Rcpp::depends(psqn)]]
#include "psqn-Rcpp-wrapper.h"
#include "psqn-reporter.h"
#include "psqn.h"
using namespace Rcpp;
using PSQN::psqn_uint; // the unsigned integer type used in the package

/// simple function to avoid copying a vector. You can ignore this
inline arma::vec vec_no_cp(double const * x, psqn_uint const n_ele){
  return arma::vec(const_cast<double *>(x), n_ele, false);
}

/***
 implements the element function for a given cluster. The class must provide
 the member functions which we provide here.

 We do not need to inherit from the element_function class but we can do it
 to ensure that we have implemented all the member functions.
 */
class m_logit_func final : public PSQN::element_function {
  /// design matrices
  arma::mat const X, Z;
  /// outcomes
  arma::vec const y;
  /// inverse covariance matrix
  arma::mat const Sigma_inv;

public:
  m_logit_func(List data):
  X        (as<arma::mat>(data["X"        ])),
  Z        (as<arma::mat>(data["Z"        ])),
  y        (as<arma::vec>(data["y"        ])),
  Sigma_inv(as<arma::mat>(data["Sigma_inv"])) { }

  /// dimension of the global parameters
  psqn_uint global_dim() const {
    return X.n_rows;
  }
  /// dimension of the private parameters
  psqn_uint private_dim() const {
    return Z.n_rows;
  }

  /***
   computes the element function.
   @param point point to compute function at.
   */
  double func(double const *point) const {
    arma::vec const beta = vec_no_cp(point           , X.n_rows),
                       u = vec_no_cp(point + X.n_rows, Z.n_rows);

    double out(0);
    for(psqn_uint i = 0; i < y.n_elem; ++i){
      double const eta =
        arma::dot(beta, X.col(i)) + arma::dot(u, Z.col(i));
      out -= y[i] * eta - log(1 + exp(eta));
    }

    out += arma::dot(u, Sigma_inv * u) * .5;

    return out;
  }

  /***
   computes the element function and its gradient.
   @param point point to compute function at.
   @param gr gradient vector with respect to global and private parameters.
   */
  double grad
    (double const * point, double * gr) const {
    arma::vec const beta = vec_no_cp(point           , X.n_rows),
                       u = vec_no_cp(point + X.n_rows, Z.n_rows);

    // create objects to write to for the gradient
    std::fill(gr, gr + beta.n_elem + u.n_elem, 0.);
    arma::vec dbeta(gr              , beta.n_elem, false),
              du   (gr + beta.n_elem, u.n_elem   , false);

    double out(0);
    for(psqn_uint i = 0; i < y.n_elem; ++i){
      arma::vec const xi = X.unsafe_col(i),
                      zi = Z.unsafe_col(i);
      double const eta = arma::dot(beta, xi) + arma::dot(u, zi),
               exp_eta = exp(eta),
               d_eta   = y[i] - exp_eta / (1 + exp_eta);
      out -= y[i] * eta - log(1 + exp_eta);
      dbeta -= d_eta * xi;
      du    -= d_eta * zi;
    }

    arma::vec u_scaled = Sigma_inv * u;
    out += arma::dot(u, u_scaled) * .5;
    du += u_scaled;

    return out;
  }

  /***
   returns true if the member functions are thread-safe.
   */
  bool thread_safe() const {
    return true;
  }
};

/***
 as an example, we provide a toy example of an equality constraint. We impose
 that some of the parameters (including the private ones) are on a ball. You
 can skip this if you do not need constraints.
 */
class mlogit_constraint : public PSQN::base_worker,
                          public PSQN::constraint_base<mlogit_constraint>
{
  double radius_sq;
  std::vector<psqn_uint> indices_vec;

public:
  mlogit_constraint(Rcpp::IntegerVector indices_in, double const radius):
  base_worker(indices_in.size()),
  radius_sq{radius * radius}
  {
    // fill in the indices
    indices_vec.reserve(indices_in.size());
    for(int i : indices_in){
      indices_vec.emplace_back
        (static_cast<psqn_uint>(i - 1)); // assume one-based
    }
  }

  /**
   there may be non-linear in-equality constraints in the future and linear
   constraints.
   */
  PSQN::constraint_type type() const {
    return PSQN::constraint_type::non_lin_eq;
  }

  psqn_uint n_constrained() const {
    return indices_vec.size();
  }
  psqn_uint const * indices() const {
    return indices_vec.data();
  }
  double func(double const *par) const {
    double out{0};
    for(psqn_uint i = 0; i < n_constrained(); ++i){
      out += par[i] * par[i];
    }
    return out - radius_sq;
  }
  double grad(double const *par, double *gr) const {
    double out{0};
    for(psqn_uint i = 0; i < n_constrained(); ++i){
      out += par[i] * par[i];
      gr[i] = 2 * par[i];
    }
    return out - radius_sq;
  }
};


using mlogit_topim = PSQN::optimizer
  <m_logit_func, PSQN::R_reporter, PSQN::R_interrupter,
   PSQN::default_caller<m_logit_func>, mlogit_constraint>;

/***
 creates a pointer to an object which is needed in the optim_mlogit
 function.
 @param data list with data for each element function.
 @param max_threads maximum number of threads to use.
 */
// [[Rcpp::export]]
SEXP get_mlogit_optimizer(List data, unsigned const max_threads){
  psqn_uint const n_elem_funcs = data.size();
  std::vector<m_logit_func> funcs;
  funcs.reserve(n_elem_funcs);
  for(auto dat : data)
    funcs.emplace_back(List(dat));

  // create an XPtr to the object we will need
  XPtr<mlogit_topim> ptr(new mlogit_topim(funcs, max_threads));

  // return the pointer to be used later
  return ptr;
}

/***
 performs the optimization.
 @param val vector with starting value for the global and private
 parameters.
 @param ptr returned object from get_mlogit_optimizer.
 @param rel_eps relative convergence threshold.
 @param max_it maximum number iterations.
 @param n_threads number of threads to use.
 @param c1,c2 thresholds for Wolfe condition.
 @param use_bfgs boolean for whether to use SR1 or BFGS updates.
 @param trace integer where larger values gives more information during the
 optimization.
 @param cg_tol threshold for conjugate gradient method.
 @param strong_wolfe true if the strong Wolfe condition should be used.
 @param max_cg maximum number of conjugate gradient iterations in each
 iteration. Use zero if there should not be a limit.
 @param pre_method preconditioning method in conjugate gradient method.
 zero yields no preconditioning, one yields diagonal preconditioning, and
 two yields the incomplete Cholesky factorization from Eigen.
 @param gr_tol convergence tolerance for the Euclidean norm of the gradient.
 A negative value yields no check.
 */
// [[Rcpp::export]]
List optim_mlogit
  (NumericVector val, SEXP ptr, double const rel_eps, unsigned const max_it,
   unsigned const n_threads, double const c1,
   double const c2, bool const use_bfgs = true, int const trace = 0L,
   double const cg_tol = .5, bool const strong_wolfe = true,
   psqn_uint const max_cg = 0L, int const pre_method = 1L,
   double const gr_tol = 1.){
  XPtr<mlogit_topim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("optim_mlogit: invalid parameter size");

  NumericVector par = clone(val);
  optim->set_n_threads(n_threads);
  auto res = optim->optim(&par[0], rel_eps, max_it, c1, c2,
                          use_bfgs, trace, cg_tol, strong_wolfe, max_cg,
                          static_cast<PSQN::precondition>(pre_method),
                          gr_tol);
  NumericVector counts = NumericVector::create(
    res.n_eval, res.n_grad,  res.n_cg);
  counts.names() = CharacterVector::create("function", "gradient", "n_cg");

  int const info = static_cast<int>(res.info);
  return List::create(
    _["par"] = par, _["value"] = res.value, _["info"] = info,
    _["counts"] = counts,
    _["convergence"] =  res.info == PSQN::info_code::converged );
}

/**
 like optim_mlogit but possibly with constraints. The additional parameters are

 @param consts list of lists which each has elements indices and radius. The
 former is the one-based indices of the constraint parameters and the later is
 the radius of the ball.
 @param max_it_outer maximum number of augmented Lagrangian step.
 @param penalty_start starting value for the augmented Lagrangian method.
 @param gr_tol convergence tolerance for the Euclidean norm of the gradient.
 A negative value yields no check.
 */
// [[Rcpp::export]]
List optim_aug_Lagrang_mlogit
  (NumericVector val, SEXP ptr, List consts, double const rel_eps,
   unsigned const max_it, unsigned const n_threads, double const c1,
   double const c2, unsigned const max_it_outer, double const penalty_start = 1,
   bool const use_bfgs = true,
   int const trace = 0L, double const cg_tol = .5,
   bool const strong_wolfe = true, psqn_uint const max_cg = 0L,
   int const pre_method = 1L, double const gr_tol = -1.){
  XPtr<mlogit_topim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("optim_mlogit: invalid parameter size");

  // add the constraints
  optim->constraints.reserve(consts.size());
  for(SEXP l : consts){
    List l_list(l);
    optim->constraints.emplace_back(as<IntegerVector>(l_list["indices"]),
                                    as<NumericVector>(l_list["radius"])[0]);
  }

  NumericVector par = clone(val);
  // the multipliers for the augmented Lagrangian method
  NumericVector multipliers(consts.size(), 0.);
  optim->set_n_threads(n_threads);

  auto res = optim->optim_aug_Lagrang
    (&par[0], &multipliers[0], penalty_start,
     rel_eps, max_it,
     max_it_outer, 1e-5,  /* violations_norm_thresh */
     c1, c2,
     1.5, /* tau */
     use_bfgs, trace, cg_tol, strong_wolfe, max_cg,
     static_cast<PSQN::precondition>(pre_method),
     gr_tol);

  // must remember to remove the constraints again
  optim->constraints.clear();

  NumericVector counts = NumericVector::create(
    res.n_eval, res.n_grad,  res.n_cg, res.n_aug_Lagrang);
  counts.names() = CharacterVector::create
    ("function", "gradient", "n_cg", "n_aug_Lagrang");

  // we have to compute the function value again if we want it without the
  // additional terms from the augmented Lagrangian method
  res.value = optim->eval(&par[0], nullptr, false);

  int const info = static_cast<int>(res.info);
  return List::create(
    _["par"] = par, _["multipliers"] = multipliers, _["value"] = res.value,
      _["info"] = info, _["counts"] = counts, _["penalty"] = res.penalty,
    _["convergence"] =  res.info == PSQN::info_code::converged );
}

/***
 performs the optimization but only for the private parameters.
 @param val vector with starting value for the global and private
 parameters.
 @param ptr returned object from get_mlogit_optimizer.
 @param rel_eps relative convergence threshold.
 @param max_it maximum number iterations.
 @param n_threads number of threads to use.
 @param c1,c2 thresholds for Wolfe condition.
 @param gr_tol convergence tolerance for the Euclidean norm of the gradient.
 A negative value yields no check.
 */
// [[Rcpp::export]]
NumericVector optim_mlogit_private
  (NumericVector val, SEXP ptr, double const rel_eps, unsigned const max_it,
   unsigned const n_threads, double const c1, double const c2,
   double const gr_tol = -1.){
  XPtr<mlogit_topim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("optim_mlogit_private: invalid parameter size");

  NumericVector par = clone(val);
  optim->set_n_threads(n_threads);
  double const res = optim->optim_priv(&par[0], rel_eps, max_it, c1, c2,
                                       gr_tol);
  par.attr("value") = res;
  return par;
}

/***
 evaluates the partially separable function.
 @param val vector with global and private parameters to evaluate the
 function at.
 @param ptr returned object from get_mlogit_optimizer.
 @param n_threads number of threads to use.
 */
// [[Rcpp::export]]
double eval_mlogit(NumericVector val, SEXP ptr, unsigned const n_threads){
  XPtr<mlogit_topim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("eval_mlogit: invalid parameter size");

  optim->set_n_threads(n_threads);
  return optim->eval(&val[0], nullptr, false);
}

/***
 evaluates the gradient of a partially separable function.
 @param val vector with global and private parameters to evaluate the
 function at.
 @param ptr returned object from get_mlogit_optimizer.
 @param n_threads number of threads to use.
 */
// [[Rcpp::export]]
NumericVector grad_mlogit(NumericVector val, SEXP ptr,
                          unsigned const n_threads){
  XPtr<mlogit_topim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("grad_mlogit: invalid parameter size");

  NumericVector grad(val.size());
  optim->set_n_threads(n_threads);
  grad.attr("value") = optim->eval(&val[0], &grad[0], true);

  return grad;
}

/***
 returns the current Hessian approximation.
 @param ptr returned object from get_mlogit_optimizer.
 */
// [[Rcpp::export]]
NumericMatrix get_Hess_approx_mlogit(SEXP ptr){
  XPtr<mlogit_topim> optim(ptr);

  NumericMatrix out(optim->n_par, optim->n_par);
  optim->get_hess(&out[0]);

  return out;
}

/***
 returns the current Hessian approximation as a sparse matrix.
 @param ptr returned object from get_mlogit_optimizer.
 */
// [[Rcpp::export]]
Eigen::SparseMatrix<double> get_sparse_Hess_approx_mlogit(SEXP ptr){
  return XPtr<mlogit_topim>(ptr)->get_hess_sparse();
}

/***
 returns the true Hessian as a sparse matrix.

 @param ptr returned object from get_mlogit_optimizer
 @param val where to evaluate the function at
 @param eps determines the step size given by max(eps, |x| * eps)
 @param scale scaling factor in the Richardson extrapolation
 @param tol relative convergence criteria given by max(tol, |f| * tol)
 @param order maximum number of iteration of the Richardson extrapolation
 */
// [[Rcpp::export]]
Eigen::SparseMatrix<double> true_hess_sparse
  (SEXP ptr, NumericVector val, double const eps = 0.001, double const scale = 2,
   double const tol = 0.000000001, unsigned const order = 6){

  XPtr<mlogit_topim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("true_hess_sparse: invalid parameter size");

  return optim->true_hess_sparse(&val[0], eps, scale, tol, order);
}

/***
 sets the masked (fixed) parameters.
 @param ptr returned object from get_mlogit_optimizer.
 @param indices zero based indices of the masked parameters.
 */
// [[Rcpp::export]]
void set_masked(SEXP ptr, Rcpp::IntegerVector indices){
  XPtr<mlogit_topim>(ptr)->set_masked(indices.begin(), indices.end());
}

/***
 clears all masked (fixed) parameters.
 @param ptr returned object from get_mlogit_optimizer.
 */
// [[Rcpp::export]]
void clear_masked(SEXP ptr){
  XPtr<mlogit_topim>(ptr)->clear_masked();
}

The PSQN::R_reporter class ensures that output will be printed when one passes a trace argument which is greater than zero. The PSQN::R_interrupter class ensures that the user can interrupt the computation. These two classes can be replaced with custom classes if one wants to and provides another implementation. See the source code of this package for the required members.

We can use the code by calling Rcpp::sourceCpp to compile the code:

library(Rcpp)
sourceCpp(system.file("mlogit-ex.cpp", package = "psqn"))
#> Registered S3 methods overwritten by 'RcppEigen':
#>   method               from         
#>   predict.fastLm       RcppArmadillo
#>   print.fastLm         RcppArmadillo
#>   summary.fastLm       RcppArmadillo
#>   print.summary.fastLm RcppArmadillo

Then we can create a pointer to an optimizer and check that it yields the correct value and gradient like this:

optimizer <- get_mlogit_optimizer(sim_dat, max_threads = 4L)

stopifnot(all.equal(
  eval_integrand(true_params), 
  eval_mlogit(val = true_params, ptr = optimizer, n_threads = 2L)))

library(numDeriv)
gr_num <- grad(
  function(par) eval_mlogit(val = par, ptr = optimizer, n_threads = 2L), 
  true_params)
gr_opt <- grad_mlogit(val = true_params, ptr = optimizer, n_threads = 2L)

stopifnot(all.equal(gr_num, gr_opt, tolerance = 1e-5, 
                    check.attributes = FALSE),
          # also check the function value!
          all.equal(attr(gr_opt, "value"), 
                    eval_mlogit(val = true_params, ptr = optimizer, 
                                n_threads = 2L)))

We can now use the BFGS implementation in the optim function to compare with like this:

optim_func <- function(par, n_threads = 2L)
  optim(
    par, function(par) 
      eval_mlogit(val = par, ptr = optimizer, n_threads = n_threads), 
    function(par) 
      grad_mlogit(val = par, ptr = optimizer, n_threads = n_threads), 
    method = "BFGS", control = list(reltol = 1e-8))

bfgs_res <- optim_func(start_val)

We then use the quasi-Newton method like this:

psqn_func <- function(par, n_threads = 2L, c1 = 1e-4, 
                      c2 = .9, trace = 0L, use_bfgs = TRUE, 
                      opt_private = FALSE, pre_method = 1L){
  rel_eps <- 1e-8
  if(opt_private){
    # it may be useful to fix the global parameters and optimize the 
    # private parameters to get starting values. This is very fast as each 
    # set of parameters can be optimized separately
    par <- optim_mlogit_private(
      val = par, ptr = optimizer, rel_eps = sqrt(rel_eps), max_it = 100, 
      n_threads = n_threads, c1 = c1, c2 = c2)
  }
  optim_mlogit(val = par, ptr = optimizer, rel_eps = rel_eps, 
               max_it = 1000L, n_threads = n_threads, c1 = c1, 
               c2 = c2, trace = trace, use_bfgs = use_bfgs, 
               pre_method = pre_method)
}

psqn_res <- psqn_func(start_val)

# using SR1 updates
psqn_res_sr1 <- psqn_func(start_val, use_bfgs = FALSE)
all.equal(psqn_res_sr1$value, psqn_res$value)
#> [1] TRUE

# w/ different starting values
psqn_res_diff_start <- psqn_func(start_val, opt_private = TRUE)
all.equal(psqn_res$value, psqn_res_diff_start$value)
#> [1] TRUE

The counts element contains the number of function evaluations, gradient evaluations, and the total number of conjugate gradient iterations:

psqn_res$counts
#> function gradient     n_cg 
#>       13       12       20

# it is the same as we got from R
all.equal(psqn_res$par, R_res$par)
#> [1] TRUE
all.equal(psqn_res$value, R_res$value)
#> [1] TRUE

# compare with optim
bfgs_res$counts
#> function gradient 
#>       62       19

We can compare the solution with the solution from optim:

all.equal(bfgs_res$par, psqn_res$par)
#> [1] "Mean relative difference: 7.81e-05"
all.equal(psqn_res$value, bfgs_res$value, tolerance = 1e-8)
#> [1] TRUE
psqn_res$value - bfgs_res$value
#> [1] -5.61e-06

The optim_mlogit takes fewer iterations possibly because we quicker get a good approximation of the Hessian. Furthermore, we only take psqn_res$counts["n_cg"], 20, conjugate gradient iterations. This in contrast to the worst case scenario where we make length(start_val), 3205, iterations for just one iteration of the quasi-Newton method! We can also compare with the limited memory BFGS minimizer from the lbfgsb3c package:

library(lbfgsb3c)
lbfgsb3c_func <- function(par, n_threads = 2L)
  lbfgsb3c(par = par, function(par) 
      eval_mlogit(val = par, ptr = optimizer, n_threads = n_threads), 
    function(par) 
      grad_mlogit(val = par, ptr = optimizer, n_threads = n_threads), 
    control = list(factr = 1e-8 * 10, maxit = 1000L))

lbfgsb3c_res <- lbfgsb3c_func(start_val)

all.equal(lbfgsb3c_res$par, psqn_res$par)
#> [1] "Mean relative difference: 1.71e-05"
all.equal(lbfgsb3c_res$value, bfgs_res$value)
#> [1] TRUE
psqn_res$value - lbfgsb3c_res$value
#> [1] 2.22e-06

We can also compare with the limited memory BFGS minimizer from the lbfgs package:

library(lbfgs)
lbfgs_func <- function(par, n_threads = 2L)
  lbfgs(vars = par, function(par) 
      eval_mlogit(val = par, ptr = optimizer, n_threads = n_threads), 
    function(par) 
      grad_mlogit(val = par, ptr = optimizer, n_threads = n_threads), 
    invisible = 1)

lbfgs_res <- lbfgs_func(start_val)

all.equal(lbfgs_res$par, psqn_res$par)
#> [1] "Mean relative difference: 1.74e-05"
all.equal(lbfgs_res$value, bfgs_res$value)
#> [1] TRUE
psqn_res$value - lbfgs_res$value
#> [1] 2.22e-06

Getting the Hessian Approximation

We can get the Hessian approximation by calling the get_Hess_approx_mlogit and get_sparse_Hess_approx_mlogit we declared after calling the optimizer:

aprox_hes <- get_Hess_approx_mlogit(ptr = optimizer)
dim(aprox_hes) # quite large; requires a lot of memory
#> [1] 3205 3205

# we can also get the sparse version
aprox_hes_sparse <- get_sparse_Hess_approx_mlogit(optimizer)
all.equal(as.matrix(aprox_hes_sparse), aprox_hes, 
          check.attributes = FALSE)
#> [1] TRUE

# this require much less memory 
object.size(aprox_hes)
#> 82176416 bytes
object.size(aprox_hes_sparse)
#> 552224 bytes

# we can roughly check against the true values as follows
if(FALSE){
  # only feasible for smaller problem
  hess_true <- jacobian(
    function(par) grad_mlogit(val = par, ptr = optimizer, n_threads = 2L), 
    psqn_res$par)
  
  # should not hold exactly! Might not be that good of an approximation.
  all.equal(aprox_hes, hess_true)
  
  # the non-zero entries should match 
  v1 <- abs(hess_true) > .Machine$double.eps * 10
  v2 <- abs(aprox_hes) > .Machine$double.eps * 10
  all.equal(v1, v2)
}

# create a plot like before. Black entries are non-zero
par(mar = c(.5, .5, .5, .5))
idx <- 1:min(1000, NROW(aprox_hes))
aprox_hes <- aprox_hes[idx, idx] # reduce dimension to plot quickly
image(abs(aprox_hes[, NCOL(aprox_hes):1]) > 0, xaxt = "n", yaxt = "n",
      col = gray.colors(2L, 1, 0))

The true Hessian is very sparse.

Getting the True Hessian

We can also get the true Hessian. This is computed with numerical differentiation using Richardson extrapolation. We provide an example of this below.

# compute the hessian with the package
system.time(hess <- true_hess_sparse(val = psqn_res$par, ptr = optimizer))
#>    user  system elapsed 
#>   0.027   0.000   0.027

# we can also compute it from R
system.time(hess_R <- psqn_hess(
  val = psqn_res$par, fn = r_func, n_ele_func = n_clusters))
#>    user  system elapsed 
#>   0.543   0.000   0.543

# compare with numerical differentiation from R. We only check the first 
# elements
n_comp <- 160L
system.time(
  hess_num_deriv <- jacobian(
      function(x) {
        par <- psqn_res$par
        par[1:n_comp] <- x
        grad_mlogit(val = par, ptr = optimizer, n_threads = 2L)[1:n_comp]
      }, 
      head(psqn_res$par, n_comp)))
#>    user  system elapsed 
#>   0.623   0.008   0.317

# we got the same
all.equal(hess, hess_R)
#> [1] TRUE
all.equal(Matrix::Matrix(hess_num_deriv, sparse = TRUE),
          hess[1:n_comp, 1:n_comp])
#> [1] TRUE

Using Different Preconditioners

We can use different preconditioners. We illustrate this below. Notice that you have to define the PSQN_USE_EIGEN macro variable prior to including any of the psqn headers files if you are using the C++ interface in order to use the incomplete Cholesky factorization from Eigen. You will also have to include RcppEigen or the psqn-Rcpp-wrapper.h header.

# without any preconditioner
psqn_res_no_pre <- psqn_func(start_val, pre_method = 0L)

# we get the same
all.equal(psqn_res$value, psqn_res_no_pre$value)
#> [1] TRUE

# we mainly use more conjugate gradient steps
psqn_res       $counts
#> function gradient     n_cg 
#>       13       12       20
psqn_res_no_pre$counts
#> function gradient     n_cg 
#>       16       15       59

# with the incomplete Cholesky factorization
psqn_res_cholesky <- psqn_func(start_val, pre_method = 2L)
all.equal(psqn_res$value, psqn_res_cholesky$value)
#> [1] TRUE

# with the Cholesky factorizations in the diagonal
psqn_res_cholesky_diag <- psqn_func(start_val, pre_method = 3L)
all.equal(psqn_res$value, psqn_res_cholesky_diag$value)
#> [1] TRUE

# we use fewer conjugate gradient steps
psqn_res_cholesky_diag$counts
#> function gradient     n_cg 
#>       12       11       14
psqn_res_cholesky     $counts
#> function gradient     n_cg 
#>       12       11       10
psqn_res              $counts
#> function gradient     n_cg 
#>       13       12       20

# we can equally use the R interface
psqn_res_cholesky_R <- r_psqn_func(start_val, pre_method = 2L)

all.equal(psqn_res_cholesky_R[c("par", "value", "counts")], 
          psqn_res_cholesky  [c("par", "value", "counts")])
#> [1] TRUE

We can also illustrate why fewer iterations are needed with the block diagonal preconditioning by looking at the Eigen values. We will need fewer iterations if the Eigen values are clustered or if many are identical.

# look at the original Eigen values
aprox_hes <- get_Hess_approx_mlogit(ptr = optimizer)
hess_sub <- aprox_hes[1:(5 + 200 * 4L), 1:(5 + 200 * 4L)]
eg <- eigen(hess_sub)$values
par(mar = c(5, 5, 1, 1))
hist(eg, breaks = 100L, main = "",
     xlab = "Eigen values without preconditioning")

# compute the Hessian after the block diagonal preconditioning
hess_sub_precond <- hess_sub
hess_sub_precond[  1:5 , -(1:5)] <- 0
hess_sub_precond[-(1:5),   1:5 ] <- 0
Cm <- t(chol(hess_sub_precond))

hess_after_cond <- solve(Cm, t(solve(Cm, hess_sub)))
eg_new <- eigen(hess_after_cond)$values
hist(eg_new, breaks = 100L, main = "",
     xlab = "Eigen values with block diagonal preconditioning")

# we can compare this with a diagonal preconditioning
tmp <- diag(diag(hess_sub)^(-2))
hess_after_cond_diag <- tmp %*% hess_sub %*% tmp
eg_diag <- eigen(hess_after_cond_diag)$values
hist(eg_diag, breaks = 100L, main = "",
     xlab = "Eigen values with diagonal preconditioning")

Benchmarking

Finally, here is a benchmark to compare the computation time:

bench::mark(
  `   optim BFGS (2 threads)`  = optim_func  (start_val, n_threads = 2L),
  `         lbfgs (1 thread)` = lbfgs_func   (start_val, n_threads = 1L),
  `         lbfgs(2 threads)` = lbfgs_func   (start_val, n_threads = 2L),
  `         lbfgs(4 threads)` = lbfgs_func   (start_val, n_threads = 4L),
  `      lbfgsb3c (1 thread)` = lbfgsb3c_func(start_val, n_threads = 1L),
  `      lbfgsb3c(2 threads)` = lbfgsb3c_func(start_val, n_threads = 2L),
  `      lbfgsb3c(4 threads)` = lbfgsb3c_func(start_val, n_threads = 4L),
  `       psqn (R; 1 thread)` = r_psqn_func  (start_val, n_threads = 1L),
  `       psqn(R; 2 threads)` = r_psqn_func  (start_val, n_threads = 2L),
  `     psqn (1 thread, SR1)` = psqn_func    (start_val, n_threads = 1L, 
                                              use_bfgs = FALSE),
  `     psqn(2 threads, SR1)` = psqn_func    (start_val, n_threads = 2L, 
                                              use_bfgs = FALSE),
  `psqn  (1 thread, opt pri.)` = psqn_func   (start_val, n_threads = 1L, 
                                              opt_private = TRUE),
  `psqn (2 threads, opt pri.)` = psqn_func   (start_val, n_threads = 2L, 
                                              opt_private = TRUE),
  `  psqn (1 thread; no pre)` = psqn_func    (start_val, n_threads = 1L, 
                                              pre_method = 0L),
  `psqn (1 thread; Cholesky)` = psqn_func    (start_val, n_threads = 1L, 
                                              pre_method = 2L),
  `psqn   (1 thread; custom)` = psqn_func    (start_val, n_threads = 1L, 
                                              pre_method = 3L),
  `psqn   (2 thread; custom)` = psqn_func    (start_val, n_threads = 2L, 
                                              pre_method = 3L),
  `          psqn (1 thread)` = psqn_func    (start_val, n_threads = 1L),
  `          psqn(2 threads)` = psqn_func    (start_val, n_threads = 2L),
  `          psqn(4 threads)` = psqn_func    (start_val, n_threads = 4L),
  check = FALSE, min_time = 5)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 20 × 6
#>    expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>    <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#>  1    optim BFGS (2 threads)     1.12s    1.13s     0.885   42.38MB    0.177
#>  2          lbfgs (1 thread)  147.97ms 152.06ms     6.46    49.52MB    2.54 
#>  3          lbfgs(2 threads)  102.37ms 104.06ms     9.53    51.27MB    3.97 
#>  4          lbfgs(4 threads)   57.69ms  59.99ms    16.6      49.5MB    6.59 
#>  5       lbfgsb3c (1 thread)  263.72ms 267.98ms     3.69    36.59MB    1.17 
#>  6       lbfgsb3c(2 threads)   91.74ms  92.99ms    10.7     22.27MB    1.78 
#>  7       lbfgsb3c(4 threads)   65.63ms  70.01ms    14.0     25.97MB    2.99 
#>  8        psqn (R; 1 thread)  188.91ms  198.2ms     4.88     7.91MB    7.81 
#>  9        psqn(R; 2 threads)  200.74ms 207.79ms     4.80     7.91MB    7.59 
#> 10      psqn (1 thread, SR1)   17.54ms  17.62ms    56.6     27.58KB    0    
#> 11      psqn(2 threads, SR1)   10.14ms  10.33ms    96.4     27.58KB    0    
#> 12 psqn  (1 thread, opt pri.)  13.76ms  13.89ms    70.7     60.27KB    0    
#> 13 psqn (2 threads, opt pri.)   7.46ms   7.64ms   131.      55.16KB    0    
#> 14   psqn (1 thread; no pre)   13.98ms  14.08ms    70.8     27.58KB    0    
#> 15 psqn (1 thread; Cholesky)   46.06ms  46.61ms    21.0     27.58KB    0    
#> 16 psqn   (1 thread; custom)   12.71ms   12.8ms    77.4     27.58KB    0    
#> 17 psqn   (2 thread; custom)    6.77ms   6.91ms   144.      27.58KB    0    
#> 18           psqn (1 thread)   10.57ms  10.64ms    92.7     27.58KB    0.200
#> 19           psqn(2 threads)    5.85ms   5.97ms   165.      27.58KB    0    
#> 20           psqn(4 threads)     3.5ms    3.6ms   277.      27.58KB    0

We see a large reduction. To be fair, we can use the C interface for the limited-memory BFGS methods to avoid re-allocating the gradient at every iteration. This will reduce their computation time. The R version of the quasi-Newton method is slower mainly as the R version to evaluate the log of the integrand and its derivative is slower than the version used by all the other methods. We can illustrate this by comparing with the computation time with the eval_integrand:

bench::mark(
  `  R` = eval_integrand(true_params), 
  `C++` = eval_mlogit(val = true_params, ptr = optimizer, n_threads = 1L), 
  min_iterations = 100)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1   R          6.93ms   7.43ms      130.  185.56KB     9.79
#> 2 C++        250.98µs 253.31µs     3854.    2.49KB     0

There is a big difference. Moreover, there is an overhead with repeatedly going back and forward between R and C++. A fair comparison would use an R implementation for all methods.

Masking (Fixing) Parameters

It is possible to mask (fix) the parameters both in the R and C++ interface. We illustrate this below.

# set the parameters to mask
par_mask <- numeric(length(psqn_res$par))
set.seed(1)
idx_mask <- sample.int(length(par_mask), 250L) # random indices we mask
# add some noise to the parameters
par_mask[idx_mask] <- par_mask[idx_mask] + rnorm(length(idx_mask), sd = .1)

# fit the model with the R interface
mask_psqn_res <- 
  psqn(par = par_mask, r_func, n_ele_func = n_clusters, 
       n_threads = 1L, c1 = 1e-4, c2 = .9, pre_method = 1L, 
       mask = idx_mask - 1L) # -1L as it is zero based

# do the same with optim
mask_optim <- optim(
    par_mask[-idx_mask], function(par){ 
      par_mask[-idx_mask] <- par
      eval_mlogit(par_mask, ptr = optimizer, n_threads = 1L)
    }, 
    function(par){
      par_mask[-idx_mask] <- par
      grad_mlogit(par_mask, ptr = optimizer, n_threads = 1L)[-idx_mask]
    }, 
    method = "BFGS", control = list(reltol = 1e-8))

# we got the same
all.equal(mask_psqn_res$par[-idx_mask], mask_optim$par)
#> [1] "Mean relative difference: 8.72e-05"

# the value is lower 
psqn_res$value - mask_psqn_res$value # recall minus one times the integrand
#> [1] -65

# the fixed parameters are the same
all.equal(mask_psqn_res$par[idx_mask], par_mask[idx_mask])
#> [1] TRUE

# the C++ interface gives the same
set_masked(ptr = optimizer, indices = idx_mask - 1L)
mask_psqn_res_cpp <- psqn_func(par_mask)
clear_masked(optimizer) # remember to clear again!

# we got the same
all.equal(mask_psqn_res$par, mask_psqn_res_cpp$par)
#> [1] TRUE

Non-linear Equality Constraints

Non-linear equality constraints are supported in both the R and the C++ interface. We added a toy example in C++ code where we constrain some of the parameters to be on a ball. We use the implementation below to make a random set restrictions on some parameters.

# find the parameters we will restrict
set.seed(1)
restrict <- replicate(50L, {
  # sample number of parameters, indices, and the radius
  n_restrict <- sample(3:7, 1L)
  indices <- sample.int(length(psqn_res$par), n_restrict)
  radius <- runif(1, .2, 1.5)
  list(indices = indices, radius = radius)
}, simplify = FALSE)

# apply the restriction
psqn_restrict <- optim_aug_Lagrang_mlogit(
  val = start_val, pre_method = 1L, ptr = optimizer, rel_eps = 1e-8, 
  max_it = 1000L, n_threads = 2L, c1 = 1e-4, c2 = .9, use_bfgs = TRUE, 
  trace = 0L, max_it_outer = 100L,
  consts = restrict)

psqn_restrict$info
#> [1] -3
psqn_restrict$value
#> [1] 5299

# the value is higher (worse)
psqn_restrict$value - psqn_res$value
#> [1] 15.8

# evaluates the constraints from R (should be zero)
consts <- function(i, par, what){
  if(length(par) == 0)
    # for the R version later
    return(restrict[[i]]$indices)
  
  out <- sum(par^2) - restrict[[i]]$radius^2
  if(what == 1)
    attr(out, "grad") <- 2 * par
  out
}

# check the constraints. All ~0
sapply(seq_along(restrict), function(i)
  consts(i, psqn_restrict$par[restrict[[i]]$indices], 0))
#>  [1] -7.26e-07  3.70e-06 -1.49e-09  1.02e-11 -1.48e-07 -7.82e-13  4.79e-10
#>  [8] -1.74e-08  5.45e-12  2.30e-09 -8.24e-08 -2.22e-16  1.30e-08 -5.83e-10
#> [15]  3.89e-09  3.19e-10  1.81e-05  2.44e-15 -6.14e-09  1.14e-06 -4.05e-11
#> [22] -2.46e-09 -3.45e-09 -2.80e-08 -2.60e-09 -3.84e-09 -1.80e-09 -1.30e-11
#> [29]  8.76e-11  2.46e-06 -1.57e-05  1.07e-09  6.71e-07  9.23e-09 -4.69e-10
#> [36]  4.70e-07  2.44e-06 -1.45e-10 -5.77e-09 -1.27e-09 -2.87e-11  1.14e-10
#> [43]  5.09e-11  3.57e-10 -1.23e-07  5.71e-09 -1.05e-06  6.38e-11 -4.72e-08
#> [50]  3.10e-10

# we can do the same from R
psqn_restrict_R <- psqn_aug_Lagrang(
  par = start_val, fn = r_func, n_ele_func = n_clusters, 
  n_threads = 1L, c1 = 1e-4, c2 = .9, pre_method = 1L, 
  consts = consts, n_constraints = length(restrict))

# we got the same
all.equal(psqn_restrict_R$par, psqn_restrict$par)
#> [1] "Mean relative difference: 3.76e-05"

# check the constraints. All ~0
sapply(seq_along(restrict), function(i)
  consts(i, psqn_restrict_R$par[restrict[[i]]$indices], 0))
#>  [1]  5.87e-08 -1.22e-05 -8.06e-09  2.11e-10  1.99e-07  2.81e-09 -2.06e-07
#>  [8] -4.00e-08  1.18e-10 -1.14e-08  2.11e-07  5.82e-12 -4.92e-08 -7.93e-09
#> [15]  2.81e-10  2.17e-08  6.48e-05  1.38e-09  2.51e-08  1.72e-05 -6.48e-10
#> [22] -1.08e-07 -2.27e-08 -6.86e-07 -6.18e-10  9.69e-09  3.48e-09  1.16e-09
#> [29]  7.41e-09 -5.52e-06  7.29e-06  1.06e-07  1.21e-05 -4.65e-08 -9.49e-10
#> [36] -1.56e-06  3.26e-05  2.06e-09  1.02e-08  5.62e-09 -9.83e-11  1.53e-09
#> [43] -2.06e-09 -2.74e-09  2.06e-07  1.11e-08 -3.21e-06  1.55e-09  1.53e-07
#> [50]  1.38e-09

Polynomial Example

We consider the following trivial (regression) example as there is an explicit solution to compare with:

\[ \begin{align*} \mathcal G &=\{1,\dots, p\} \\ \mathcal G \cap \mathcal P_i &= \emptyset \\ \mathcal P_j \cap \mathcal P_i &= \emptyset, \qquad i\neq j \\ \mathcal I_i &\in \{1,\dots, p\}^{\lvert\mathcal P_i\rvert} \\ f(\vec x) &= (\vec x_{\mathcal G} - \vec\mu_{\mathcal G})^\top (\vec x_{\mathcal G} - \vec\mu_{\mathcal G}) + \sum_{i = 1}^n (\vec x_{\mathcal P_i} - \vec\mu_{\mathcal P_i} - \mat\Psi_i\vec x_{\mathcal I_i})^\top (\vec x_{\mathcal P_i} - \vec\mu_{\mathcal P_i} - \mat\Psi_i\vec x_{\mathcal I_i}) \end{align*} \] This is not because the problem is interesting per se but it is meant as another illustration. R code to simulate from this model is given below:

# simulate the data
set.seed(1)
n_global <- 10L
n_clusters <- 50L

mu_global <- rnorm(n_global)
idx_start <- n_global

cluster_dat <- replicate(n_clusters, {
  n_members <- sample.int(n_global, 1L)
  g_idx <- sort(sample.int(n_global, n_members))
  mu_cluster <- rnorm(n_members)
  Psi <- matrix(rnorm(n_members * n_members), n_members, n_members)
  
  out <- list(idx = idx_start + 1:n_members, g_idx = g_idx,
              mu_cluster = mu_cluster, Psi = Psi)
  idx_start <<- idx_start + n_members
  out
}, simplify = FALSE)

# assign matrices needed for comparisons
library(Matrix)
M <- diag(idx_start)
for(cl in cluster_dat)
  M[cl$idx, cl$g_idx] <- -cl$Psi
M <- as(M, "dgCMatrix")

# Assign two R functions to evaluate the objective function. There are two 
# versions of the function to show that we get the same with one being 
# closer to the shown equation
fn_one <- function(par, ...){
  delta <- par[1:n_global] - mu_global
  out <- drop(delta %*% delta)
  for(cl in cluster_dat){
    delta <- drop(par[cl$idx] - cl$mu_cluster - cl$Psi %*% par[cl$g_idx])
    out <- out + drop(delta %*% delta)
  }
  out
}
fn_two <- function(par, ...){
  mu <- c(mu_global, unlist(sapply(cluster_dat, "[[", "mu_cluster")))
  delta <- drop(M %*% par - mu)
  drop(delta %*% delta)
}

tmp <- rnorm(idx_start)
all.equal(fn_one(tmp), fn_two(tmp)) # we get the same w/ the two
#> [1] TRUE
fn <- fn_two
rm(fn_one, fn_two, tmp)

# assign gradient function
gr <- function(par, ...){
  mu <- c(mu_global, unlist(sapply(cluster_dat, "[[", "mu_cluster")))
  2 * drop(crossprod(M, drop(M %*% par - mu)))
}

# we can easily find the explicit solution
mu <- c(mu_global, unlist(sapply(cluster_dat, "[[", "mu_cluster")))
exp_res <- drop(solve(M, mu))
fn(exp_res) # ~ zero as it should be
#> [1] 2.98e-29

C++ code to work with this function is provided at system.file("poly-ex.cpp", package = "psqn") with the package and given below:

// see `mlogit-ex.cpp` for an example with more comments

// we will use OpenMP to perform the computation in parallel
// [[Rcpp::plugins(openmp, cpp11)]]

// we use RcppArmadillo to simplify the code
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::depends(psqn)]]
#include "psqn.h"
#include "psqn-reporter.h"

using namespace Rcpp;
using PSQN::psqn_uint; // the unsigned integer type used in the package

/// simple function to avoid copying a vector. You can ignore this
inline arma::vec vec_no_cp(double const * x, psqn_uint const n_ele){
  return arma::vec(const_cast<double *>(x), n_ele, false);
}

class poly_func final : public PSQN::element_function {
  /// global parameter indices
  arma::uvec const g_idx;
  /// centroid vector
  arma::vec const mu_cluster;
  /// matrix used to transform subset of global parameters
  arma::mat const Psi;
  /// number of global parameters
  psqn_uint const n_global;
  /// global parameter centroid vector
  arma::vec const mu_global;
  /**
   true if this element function should compute the terms from the global
   paramaters */
  bool const comp_global;

public:
  poly_func(List data, arma::vec const &mu_g, bool const comp_global):
  g_idx     (as<arma::uvec>(data["g_idx"    ]) - 1L),
  mu_cluster(as<arma::vec>(data["mu_cluster"])     ),
  Psi       (as<arma::mat>(data["Psi"       ])     ),
  n_global(mu_g.n_elem),
  mu_global(comp_global ? mu_g : arma::vec() ),
  comp_global(comp_global)
  { }

  psqn_uint global_dim() const {
    return n_global;
  }
  psqn_uint private_dim() const {
    return mu_cluster.n_elem;
  }

  double func(double const *point) const {
    arma::vec const x_glob = vec_no_cp(point           , n_global),
                    x_priv = vec_no_cp(point + n_global, mu_cluster.n_elem),
                     delta = x_priv - Psi * x_glob(g_idx) - mu_cluster;

    // compute the function
    double out(0);
    out += arma::dot(delta, delta);

    if(comp_global)
      out += arma::dot(x_glob - mu_global, x_glob - mu_global);

    return out;
  }

  double grad
  (double const * point, double * gr) const {
    arma::vec const x_glob = vec_no_cp(point           , n_global),
                    x_priv = vec_no_cp(point + n_global, mu_cluster.n_elem),
                     delta = x_priv - Psi * x_glob(g_idx) - mu_cluster;

    // create objects to write to for the gradient
    std::fill(gr, gr + x_glob.n_elem, 0.);
    arma::vec d_glob(gr                , x_glob.n_elem, false),
              d_priv(gr + x_glob.n_elem, x_priv.n_elem, false);

    // compute the function and the gradient
    double out(0);
    out += arma::dot(delta, delta);
    d_glob(g_idx) -= 2 * Psi.t() * delta;
    d_priv         = 2 * delta;

    if(comp_global){
      out += arma::dot(x_glob - mu_global, x_glob - mu_global);
      d_glob += 2. * x_glob;
      d_glob -= 2 * mu_global;
    }

    return out;
  }

  bool thread_safe() const {
    return true;
  }
};

using poly_optim = PSQN::optimizer<poly_func, PSQN::R_reporter,
                                   PSQN::R_interrupter>;

// [[Rcpp::export]]
SEXP get_poly_optimizer(List data, arma::vec const &mu_global,
                        unsigned const max_threads){
  psqn_uint const n_elem_funcs = data.size();
  std::vector<poly_func> funcs;
  funcs.reserve(n_elem_funcs);
  bool comp_global(true);
  for(auto dat : data){
    funcs.emplace_back(List(dat), mu_global, comp_global);
    comp_global = false;
  }

  // create an XPtr to the object we will need
  XPtr<poly_optim>ptr(new poly_optim(funcs, max_threads));

  // return the pointer to be used later
  return ptr;
}

// [[Rcpp::export]]
List optim_poly
  (NumericVector val, SEXP ptr, double const rel_eps, unsigned const max_it,
   unsigned const n_threads, double const c1,
   double const c2, bool const use_bfgs = true, int const trace = 0L,
   double const cg_tol = .5, bool const strong_wolfe = true,
   psqn_uint const max_cg = 0L, int const pre_method = 1L){
  XPtr<poly_optim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("optim_poly: invalid parameter size");

  NumericVector par = clone(val);
  optim->set_n_threads(n_threads);
  auto res = optim->optim(&par[0], rel_eps, max_it, c1, c2,
                          use_bfgs, trace, cg_tol, strong_wolfe, max_cg,
                          static_cast<PSQN::precondition>(pre_method));
  NumericVector counts = NumericVector::create(
    res.n_eval, res.n_grad,  res.n_cg);
  counts.names() = CharacterVector::create("function", "gradient", "n_cg");

  int const info = static_cast<int>(res.info);
  return List::create(
    _["par"] = par, _["value"] = res.value, _["info"] = info,
      _["counts"] = counts,
      _["convergence"] =  res.info == PSQN::info_code::converged);
}

// [[Rcpp::export]]
double eval_poly(NumericVector val, SEXP ptr, unsigned const n_threads){
  XPtr<poly_optim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("eval_poly: invalid parameter size");

  optim->set_n_threads(n_threads);
  return optim->eval(&val[0], nullptr, false);
}

// [[Rcpp::export]]
NumericVector grad_poly(NumericVector val, SEXP ptr,
                        unsigned const n_threads){
  XPtr<poly_optim> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("grad_poly: invalid parameter size");

  NumericVector grad(val.size());
  optim->set_n_threads(n_threads);
  grad.attr("value") = optim->eval(&val[0], &grad[0], true);

  return grad;
}

We can Rcpp::sourceCpp the file and use the code like below to find the solution:

library(Rcpp)
sourceCpp(system.file("poly-ex.cpp", package = "psqn"))

# get a pointer to C++ object
optimizer <- get_poly_optimizer(cluster_dat, mu_global = mu_global, 
                                max_threads = 2L)

# we get the same function value and gradient
tmp <- rnorm(idx_start)
all.equal(fn       (tmp), 
          eval_poly(tmp, optimizer, 1L))
#> [1] TRUE
all.equal(gr       (tmp), 
          grad_poly(tmp, optimizer, 1L), 
          check.attributes = FALSE)
#> [1] TRUE

# run the optimization
psqn_func <- function(par, n_threads = 2L, c1 = 1e-4, 
                      c2 = .9, trace = 0L)
  optim_poly(val = par, ptr = optimizer, rel_eps = 1e-8, max_it = 1000L, 
             n_threads = n_threads, c1 = c1, 
             c2 = c2, trace = trace)

psqn_res <- psqn_func(numeric(idx_start))
all.equal(exp_res, psqn_res$par)
#> [1] TRUE

A version using the R function psqn is:

# assign function to pass to psqn
r_func <- function(i, par, comp_grad){
  dat <- cluster_dat[[i]]
  g_idx <- dat$g_idx
  mu_cluster <- dat$mu_cluster
  Psi <- dat$Psi
  
  if(length(par) < 1)
    # requested the dimension of the parameter
    return(c(global_dim = length(mu_global), 
             private_dim = length(mu_cluster)))
  
  is_glob <- 1:length(mu_global)
  x_glob <- par[is_glob]
  x_priv <- par[-is_glob]
  
  delta <- drop(x_priv - Psi %*% x_glob[g_idx] - mu_cluster)
  
  out <- drop(delta %*% delta)
  if(i == 1L){
    delta_glob <- x_glob - mu_global
    out <- out + drop(delta_glob %*% delta_glob)
  }
  
  if(comp_grad){
    grad <- numeric(length(mu_cluster) + length(mu_global))
    grad[g_idx] <- -2 * drop(crossprod(Psi, delta))
    grad[-is_glob] <- 2 * delta
    if(i == 1L)
      grad[is_glob] <- grad[is_glob] + 2 * delta_glob
    attr(out, "grad") <- grad
  }
  
  out
}

# use the function
r_psqn_func <- function(par, n_threads = 2L, c1 = 1e-4, cg_tol = .5,
                        c2 = .9, trace = 0L, pre_method = 1L)
  psqn(par = par, fn = r_func, n_ele_func = n_clusters, 
       n_threads = n_threads, c1 = c1, c2 = c2, cg_tol = cg_tol,
       trace = trace, max_it = 1000L, pre_method = pre_method)

R_res <- r_psqn_func(numeric(idx_start))
all.equal(exp_res, R_res$par)
#> [1] TRUE

# with the Cholesky factorizations in the diagonal
R_res_chol_diag <- r_psqn_func(numeric(idx_start), pre_method = 3L)
all.equal(exp_res, R_res_chol_diag$par)
#> [1] TRUE

# some reduction in the number of conjugate gradient steps
R_res$counts
#> function gradient     n_cg 
#>      128      127     1284
R_res_chol_diag$counts
#> function gradient     n_cg 
#>      140      139      712

Generic Example

We will provide a toy example of a problem which is partially separable but which does not have the same structure as the problems we have shown before. The problem we consider is:

\[ \begin{align*} f(\vec x) &= \sum_{i = 1}^n -y_i \sum_{j = 1}^{L_i} x_{k_{ij}} +\exp\left(\sum_{j = 1}^{L_i} x_{k_{ij}}\right) , \qquad \vec x\in\mathbb R^K \\ \mathcal K_i &= \{k_{i1} , \dots, k_{iL_i}\} \subseteq \{1, \dots, K\}. \end{align*} \] This is a special kind of a GLM with a Poison model with the log link. While there are other ways to estimate this model, we will mainly compare the BFGS implementation from optim with the psqn package. For some $j\neq i$, we will have that $\mathcal K_i \cap \mathcal K_j \neq \emptyset$ without much structure in their intersection unlike before.

There is a class called optimizer_generic provided by the psqn package which can work with more general partially separable problems like the one above. This though yields some additional computational overhead. A C++ implementation to work with the function stated above using the optimizer_generic class is in the generic_example.cpp file which is shown below:

// see `mlogit-ex.cpp` for an example with more comments

// we will use OpenMP to perform the computation in parallel
// [[Rcpp::plugins(openmp, cpp11)]]

// we change the unsigned integer type that is used by the package by defining
// the PSQN_SIZE_T macro variable
#define PSQN_SIZE_T unsigned int

// we want to use the incomplete Cholesky factorization as the preconditioner
// and therefore with need RcppEigen
#define PSQN_USE_EIGEN
// [[Rcpp::depends(RcppEigen)]]

// [[Rcpp::depends(psqn)]]
#include "psqn-Rcpp-wrapper.h"
#include "psqn-reporter.h"
#include "psqn.h"
using namespace Rcpp;
using PSQN::psqn_uint; // the unsigned integer type used in the package

class generic_example final : public PSQN::element_function_generic {
  /// number of argument to this element function;
  psqn_uint const n_args_val;
  /// indices of the element function parameters
  std::unique_ptr<psqn_uint[]> indices_array;
  /// y point
  double const y;

public:
  generic_example(List data):
  n_args_val(as<IntegerVector>(data["indices"]).size()),
  indices_array(([&]() -> std::unique_ptr<psqn_uint[]> {
    IntegerVector indices = as<IntegerVector>(data["indices"]);
    std::unique_ptr<psqn_uint[]> out(new psqn_uint[n_args_val]);
    for(psqn_uint i = 0; i < n_args_val; ++i)
      out[i] = indices[i];
    return out;
  })()),
  y(as<double>(data["y"]))
  { }

  // we need to make a copy constructor because of the unique_ptr
  generic_example(generic_example const &other):
  n_args_val(other.n_args_val),
  indices_array(([&]() -> std::unique_ptr<psqn_uint[]> {
    std::unique_ptr<psqn_uint[]> out(new psqn_uint[n_args_val]);
    for(psqn_uint i = 0; i < n_args_val; ++i)
      out[i] = other.indices_array[i];
    return out;
  })()),
  y(other.y) { }

  /**
   returns the number of parameters that this element function is depending on.
   */
  psqn_uint n_args() const {
    return n_args_val;
  }

  /**
   zero-based indices to the parameters that this element function is depending
   on.
   */
  psqn_uint const * indices() const {
    return indices_array.get();
  }

  double func(double const * point) const {
    double sum(0.);
    for(psqn_uint i = 0; i < n_args_val; ++i)
      sum += point[i];
    return -y * sum + std::exp(sum);
  }

  double grad
  (double const * point, double * gr) const {
    double sum(0.);
    for(psqn_uint i = 0; i < n_args_val; ++i)
      sum += point[i];
    double const exp_sum = std::exp(sum),
                    fact = -y + exp_sum;
    for(psqn_uint i = 0; i < n_args_val; ++i)
      gr[i] = fact;

    return -y * sum + std::exp(sum);
  }

  bool thread_safe() const {
    return true;
  }
};

using generic_opt =
  PSQN::optimizer_generic<generic_example, PSQN::R_reporter,
                          PSQN::R_interrupter>;

// [[Rcpp::export]]
SEXP get_generic_ex_obj(List data, unsigned const max_threads){
  psqn_uint const n_elem_funcs = data.size();
  std::vector<generic_example> funcs;
  funcs.reserve(n_elem_funcs);
  for(auto dat : data)
    funcs.emplace_back(List(dat));

  // create an XPtr to the object we will need
  XPtr<generic_opt>ptr(new generic_opt(funcs, max_threads));

  // return the pointer to be used later
  return ptr;
}

// [[Rcpp::export]]
List optim_generic_ex
  (NumericVector val, SEXP ptr, double const rel_eps, unsigned const max_it,
   unsigned const n_threads, double const c1,
   double const c2, bool const use_bfgs = true, int const trace = 0L,
   double const cg_tol = .5, bool const strong_wolfe = true,
   psqn_uint const max_cg = 0L, int const pre_method = 1L,
   double const gr_tol = -1){
  XPtr<generic_opt> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("optim_generic_ex: invalid parameter size");

  NumericVector par = clone(val);
  optim->set_n_threads(n_threads);
  auto res = optim->optim(&par[0], rel_eps, max_it, c1, c2,
                          use_bfgs, trace, cg_tol, strong_wolfe, max_cg,
                          static_cast<PSQN::precondition>(pre_method),
                          gr_tol);
  NumericVector counts = NumericVector::create(
    res.n_eval, res.n_grad,  res.n_cg);
  counts.names() = CharacterVector::create("function", "gradient", "n_cg");

  int const info = static_cast<int>(res.info);
  return List::create(
    _["par"] = par, _["value"] = res.value, _["info"] = info,
      _["counts"] = counts,
      _["convergence"] =  res.info == PSQN::info_code::converged);
}

// [[Rcpp::export]]
double eval_generic_ex(NumericVector val, SEXP ptr, unsigned const n_threads){
  XPtr<generic_opt> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("eval_generic_ex: invalid parameter size");

  optim->set_n_threads(n_threads);
  return optim->eval(&val[0], nullptr, false);
}

// [[Rcpp::export]]
NumericVector grad_generic_ex(NumericVector val, SEXP ptr,
                              unsigned const n_threads){
  XPtr<generic_opt> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("grad_generic_ex: invalid parameter size");

  NumericVector grad(val.size());
  optim->set_n_threads(n_threads);
  grad.attr("value") = optim->eval(&val[0], &grad[0], true);

  return grad;
}

// [[Rcpp::export]]
NumericMatrix get_Hess_approx_generic(SEXP ptr){
  XPtr<generic_opt> optim(ptr);

  NumericMatrix out(optim->n_par, optim->n_par);
  optim->get_hess(&out[0]);

  return out;
}

// [[Rcpp::export]]
Eigen::SparseMatrix<double> get_sparse_Hess_approx_generic(SEXP ptr){
  return XPtr<generic_opt>(ptr)->get_hess_sparse();
}

// [[Rcpp::export]]
Eigen::SparseMatrix<double> true_hess_sparse
  (SEXP ptr, NumericVector val, double const eps = 0.001, double const scale = 2,
   double const tol = 0.000000001, unsigned const order = 6){

  XPtr<generic_opt> optim(ptr);

  // check that we pass a parameter value of the right length
  if(optim->n_par != static_cast<psqn_uint>(val.size()))
    throw std::invalid_argument("true_hess_sparse: invalid parameter size");

  return optim->true_hess_sparse(&val[0], eps, scale, tol, order);
}

// [[Rcpp::export]]
void set_masked(SEXP ptr, Rcpp::IntegerVector indices){
  XPtr<generic_opt>(ptr)->set_masked(indices.begin(), indices.end());
}

// [[Rcpp::export]]
void clear_masked(SEXP ptr){
  XPtr<generic_opt>(ptr)->clear_masked();
}

The required member functions are very similar to those needed for the optimizer class. We now simulate some data to work with this type of model as an example:

# parameters for the simulation
set.seed(1)
K <- 2000L
n <- 5L * K

# simulate the data
truth_limit <- runif(K, -1, 1)
dat <- replicate(
  n, {
    # sample the indices
    n_samp <- sample.int(5L, 1L) + 1L
    indices <- sort(sample.int(K, n_samp))

    # sample the outcome, y, and return    
    list(y = rpois(1, exp(sum(truth_limit[indices]))), 
         indices = indices)
  }, simplify = FALSE)

# we need each variable to be present at least once
stopifnot(length(unique(unlist(
  lapply(dat, `[`, "indices")
))) == K) # otherwise we need to change the code

We can minimize this problem using the following R code:

# define a version in R to compute the function and its gradient
R_func <- function(x){
  out <- vapply(dat, function(z){
    eta <- sum(x[z$indices])
    -z$y * eta + exp(eta)
  }, 0.)
  sum(out)
}

R_func_gr <- function(x){
  out <- numeric(length(x))
  for(z in dat){
    idx_i <- z$indices 
    eta <- sum(x[idx_i])
    out[idx_i] <- out[idx_i] -z$y + exp(eta) 
  }
  out
}

# find the optimum
set.seed(1)
start <- truth_limit + rnorm(K, sd = .1)
# all.equal(numDeriv::grad(R_func, start), R_func_gr(start))
opt <- optim(start, R_func, R_func_gr, method = "BFGS", 
             control = list(maxit = 1000L))
opt$value # optimal solution
#> [1] -8621

The model can also be estimated using glm:

# create the design matrix
X <- t(vapply(dat, function(x){
  out <- numeric(K)
  out[x$indices] <- 1
  out
}, numeric(K)))
y <- unlist(sapply(dat, `[[`, "y"))

# this is very slow...
system.time(glm_fit <- glm.fit(x = X, y = y, family = poisson(), 
                               start = start))
#>    user  system elapsed 
#> 152.222   0.589 152.924

# we get (roughly) the same
all.equal(glm_fit$coefficients, opt$par)
#> [1] "Mean relative difference: 9.95e-05"
R_func(glm_fit$coefficients)
#> [1] -8621
glm_fit <- glm_fit[c("coefficients", "deviance")]

# the objects are quite big so we remove them 
rm(X, y)

The above could possibly be done much faster if sparse matrices where used in glm.fit. The problem can also be solved with the C++ implementation using the following code:

# source the C++ version used in this package
library(Rcpp)
sourceCpp(system.file("generic_example.cpp", package = "psqn"))

# create the list to pass to C++
cpp_arg <- lapply(dat, function(x){
  x$indices <- x$indices - 1L # C++ needs zero-based indices
  x
})
ptr <- get_generic_ex_obj(cpp_arg, max_threads = 4L)

# check that we get the same
noise <- rnorm(K)
all.equal(eval_generic_ex(noise, ptr = ptr, n_threads = 1L),
          R_func         (noise))
#> [1] TRUE
all.equal(gv <- grad_generic_ex(noise, ptr = ptr, n_threads = 1L),
          R_func_gr            (noise), 
          check.attributes = FALSE)
#> [1] TRUE
all.equal(attr(gv, "value"), R_func(noise))
#> [1] TRUE

# also gives the same with two threads
all.equal(eval_generic_ex(noise, ptr = ptr, n_threads = 2L),
          R_func         (noise))
#> [1] TRUE
all.equal(gv <- grad_generic_ex(noise, ptr = ptr, n_threads = 2L),
          R_func_gr            (noise), 
          check.attributes = FALSE)
#> [1] TRUE
all.equal(attr(gv, "value"), R_func(noise))
#> [1] TRUE

# optimize and compare the result with one thread
psqn_func <- function(par, n_threads = 1L, c1 = 1e-4, c2 = .1, trace = 0L, 
                      pre_method = 1L, cg_tol = .5)
  optim_generic_ex(val = par, ptr = ptr, rel_eps = 1e-9, max_it = 1000L, 
                   n_threads = n_threads, c1 = c1, c2 = c2, trace = trace, 
                   cg_tol = cg_tol, pre_method = pre_method)

opt_psqn <- psqn_func(start)
all.equal(opt_psqn$par, opt$par, tolerance = 1e-3)
#> [1] TRUE
all.equal(opt_psqn$value, opt$value)
#> [1] TRUE
opt_psqn$value - opt$value # (negative values implies a better solution)
#> [1] -0.000101

all.equal(opt_psqn$par, glm_fit$coefficients, tolerance = 1e-3)
#> [1] TRUE
all.equal(opt_psqn$value, R_func(glm_fit$coefficients))
#> [1] TRUE

# compare counts
opt_psqn$counts
#> function gradient     n_cg 
#>       96       51       68
opt     $counts
#> function gradient 
#>      425      103

# we can do the same with two threads
opt_psqn <- psqn_func(start, n_threads = 2L)
all.equal(opt_psqn$par, opt$par, tolerance = 1e-3)
#> [1] TRUE
all.equal(opt_psqn$value, opt$value)
#> [1] TRUE
opt_psqn$value - opt$value # (negative values implies a better solution)
#> [1] -0.000101

# compare counts
opt_psqn$counts
#> function gradient     n_cg 
#>       96       51       68
opt     $counts
#> function gradient 
#>      425      103

The C++ version is much faster:

bench::mark(
  `R version` = optim(start, R_func, R_func_gr, method = "BFGS"), 
  `R w/ C++` = optim(
    start, 
    function(x) eval_generic_ex(x, ptr = ptr, n_threads = 1L), 
    function(x) grad_generic_ex(x, ptr = ptr, n_threads = 1L), 
    method = "BFGS"), 
  psqn = psqn_func(start, n_threads = 1L), 
  `psqn 2 threads` = psqn_func(start, n_threads = 2L), 
  min_iterations = 2L, check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 R version         8.37s    8.42s     0.119    55.1MB     5.94
#> 2 R w/ C++          2.01s    2.09s     0.478      26MB     0   
#> 3 psqn            68.98ms  69.47ms    14.4      18.2KB     0   
#> 4 psqn 2 threads  37.43ms  37.89ms    26.2      18.2KB     0

Generic Example: R Interface

A R interface to the optimizer_generic class is provided through the psqn_generic function. We show an example below of how the R interface can be used to solve the same problem we had before:

# assign the function to pass to psqn_generic
r_func <- function(i, par, comp_grad){
  z <- dat[[i]]
  if(length(par) == 0L)
    # we need to return the indices of the parameters that this element function
    # depends on. These need to be one-based like in R
    return(z$indices)
  
  # we need to compute the element function and possibly its gradient
  eta <- sum(par)
  exp_eta <- exp(eta)
  out <- -z$y * eta + exp_eta
  if(comp_grad)
    attr(out, "grad") <- rep(-z$y + exp_eta, length(z$indices))
  out
}

# estimate the model with the R interface
psqn_func_R <- function(par, c1 = 1e-4, c2 = .1, trace = 0L, pre_method = 1L)
  psqn_generic(par = par, fn = r_func, n_ele_func = length(dat), 
               c1 = c1, c2 = c2, trace = trace, rel_eps = 1e-9, max_it = 1000L, 
               pre_method = pre_method)

opt_psqn_res <- psqn_func_R(start)

# we get the same 
opt_psqn <- psqn_func(start, n_threads = 1L)
all.equal(opt_psqn_res$par  , opt_psqn$par  )
#> [1] TRUE
all.equal(opt_psqn_res$value, opt_psqn$value)
#> [1] TRUE

The R version is much slower though in this case. The reason is that the element functions are extremely cheap computationally to evaluate and therefore the extra overhead from the R interface is an issue.

# show the computation time
bench::mark(psqn_func_R(start), min_iterations = 2L)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 × 6
#>   expression              min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 psqn_func_R(start)    3.08s    3.09s     0.323    29.1KB     14.1

Getting the Hessian Approximation

We can get the Hessian approximation by calling the get_Hess_approx_generic and get_sparse_Hess_approx_generic we declared after calling the optimizer:

aprox_hes <- get_Hess_approx_generic(ptr)
dim(aprox_hes) # quite large; requires a lot of memory
#> [1] 2000 2000

# we can also get the sparse version
aprox_hes_sparse <- get_sparse_Hess_approx_generic(ptr)
all.equal(as.matrix(aprox_hes_sparse), aprox_hes, 
          check.attributes = FALSE)
#> [1] TRUE

# this require much less memory 
object.size(aprox_hes)
#> 32000216 bytes
object.size(aprox_hes_sparse)
#> 1717344 bytes

# we can roughly check against the true values as follows
if(FALSE){
  # only feasible for smaller problem
  library(numDeriv)
  hess_true <- jacobian(
    function(par) grad_generic_ex(val = par, ptr = ptr, n_threads = 2L), 
    opt_psqn$par)
  
  # should not hold exactly! Might not be that good of an approximation.
  all.equal(aprox_hes, hess_true)
  
  # the non-zero entries should match 
  v1 <- abs(hess_true) > .Machine$double.eps * 10
  v2 <- abs(aprox_hes) > .Machine$double.eps * 10
  all.equal(v1, v2)
}

# create a plot like before. Black entries are non-zero
par(mar = c(.5, .5, .5, .5))
idx <- 1:min(1000, NROW(aprox_hes))
aprox_hes <- aprox_hes[idx, idx] # reduce dimension to plot quickly
image(abs(aprox_hes[, NCOL(aprox_hes):1]) > 0, xaxt = "n", yaxt = "n",
      col = gray.colors(2L, 1, 0))

Getting the True Hessian

We can also get the true Hessian as before using numerical differentiation.

# compute the hessian with the package
system.time(hess <- true_hess_sparse(val = opt_psqn$par, ptr = ptr))
#>    user  system elapsed 
#>   0.011   0.000   0.011

# we can also compute it from R
system.time(hess_R <- psqn_generic_hess(
  val = opt_psqn$par, fn = r_func, n_ele_func = length(dat)))
#>    user  system elapsed 
#>   0.917   0.000   0.917

# compare with numerical differentiation from R. We only check the first 
# elements
n_comp <- 600L
system.time(
  hess_num_deriv <- jacobian(
      function(x) {
        par <- opt_psqn$par
        par[1:n_comp] <- x
        grad_generic_ex(val = par, ptr = ptr, n_threads = 2L)[1:n_comp]
      }, 
      head(opt_psqn$par, n_comp)))
#>    user  system elapsed 
#>   1.946   0.012   0.986

# we got the same
all.equal(hess, hess_R)
#> [1] TRUE
all.equal(Matrix::Matrix(hess_num_deriv, sparse = TRUE),
          hess[1:n_comp, 1:n_comp])
#> [1] TRUE

Using Different Preconditioners

We can use different preconditioner. We illustrate this below. Notice that you have to define the PSQN_USE_EIGEN macro variable prior to including any of the psqn headers files if you are using the C++ interface in order to use the incomplete Cholesky factorization from Eigen. You will also have to include RcppEigen or the psqn-Rcpp-wrapper.h header.

# without any preconditioner
opt_psqn_no_pre <- psqn_func(start, pre_method = 0L)

# we get the same
all.equal(opt_psqn$value, opt_psqn_no_pre$value)
#> [1] TRUE

# we mainly use more conjugate gradient steps
opt_psqn       $counts
#> function gradient     n_cg 
#>       96       51       68
opt_psqn_no_pre$counts
#> function gradient     n_cg 
#>       94       48      140

# with the incomplete Cholesky factorization
opt_psqn_cholesky <- psqn_func(start, pre_method = 2L)
all.equal(opt_psqn$value, opt_psqn_cholesky$value)
#> [1] TRUE

# we use fewer conjugate gradient steps
opt_psqn_cholesky$counts
#> function gradient     n_cg 
#>       92       49       29
opt_psqn         $counts
#> function gradient     n_cg 
#>       96       51       68

# we can equally use the R interface
opt_psqn_cholesky_R <- psqn_func_R(start, pre_method = 2L)
all.equal(opt_psqn_cholesky_R[c("par", "value", "counts")], 
          opt_psqn_cholesky  [c("par", "value", "counts")])
#> [1] TRUE

Masking Parameters

We can mask (fix) the parameters using both the R and C++ interface as shown below.

# set the parameters to mask
par_mask <- numeric(length(opt_psqn_res$par))
set.seed(1)
idx_mask <- sample.int(length(par_mask), 100L) # random indices we mask
# add some noise to the parameters
par_mask[idx_mask] <- par_mask[idx_mask] + rnorm(length(idx_mask), sd = .1)

# use the R interface
mask_psqn_res <- psqn_generic(
  par = par_mask, fn = r_func, n_ele_func = length(dat), c1 = 1e-4, c2 = .1, 
  trace = 0L, rel_eps = 1e-9, max_it = 1000L, pre_method = 1L,
  mask = idx_mask - 1L) # -1L to be zero based

# we can do the same with optim
opt <- optim(
  par_mask[-idx_mask], 
  function(x) { par_mask[-idx_mask] <- x; R_func   (par_mask) }, 
  function(x) { par_mask[-idx_mask] <- x; R_func_gr(par_mask)[-idx_mask] }, 
  method = "BFGS", control = list(maxit = 1000L))

# we got the same
all.equal(mask_psqn_res$par[-idx_mask], opt$par)
#> [1] "Mean relative difference: 0.000367"
mask_psqn_res$value - opt$value # negative values is a better solution
#> [1] -0.00132

# the fixed parameters are the same
all.equal(mask_psqn_res$par[idx_mask], par_mask[idx_mask])
#> [1] TRUE

# we can also do this from C++
set_masked(ptr, idx_mask - 1L)
mask_psqn_res_cpp <- psqn_func(par_mask)
clear_masked(ptr) # remember to clear!

# we got the same
all.equal(mask_psqn_res_cpp$par, mask_psqn_res$par)
#> [1] TRUE

Using Kahan Summation Algorithm

By default, Kahan summation algorithm is used with the optimizer_generic class. This can be avoided by defining PSQN_NO_USE_KAHAN prior to including any headers from the psqn package. One may want to do so if numerical stability does not matter for a given problem. Notice though that the extra computation time may only be substantial if the func and grad member functions are evaluated very fast.

To illustrate this, we will show that the previous method yields almost the same regardless of the number of threads. Then we will show that the difference is larger when we define PSQN_NO_USE_KAHAN but the computation time is reduced.

# we get almost the same regardless of the number of threads. We show this by 
# looking at the mean relative error of the gradient
g1 <- grad_generic_ex(noise, ptr = ptr, n_threads = 1L)
g2 <- grad_generic_ex(noise, ptr = ptr, n_threads = 2L)
g3 <- grad_generic_ex(noise, ptr = ptr, n_threads = 3L)

mean(abs((g1 - g2) / g1)) 
#> [1] 1.3e-16
mean(abs((g1 - g3) / g1))
#> [1] 1.23e-16

# the computation time is as follows
bench::mark(
  `1 thread`  = grad_generic_ex(noise, ptr = ptr, n_threads = 1L), 
  `2 threads` = grad_generic_ex(noise, ptr = ptr, n_threads = 2L), 
  `3 threads` = grad_generic_ex(noise, ptr = ptr, n_threads = 3L))
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 1 thread      364µs    372µs     2587.    18.2KB        0
#> 2 2 threads     168µs    174µs     5588.    18.2KB        0
#> 3 3 threads     120µs    127µs     7743.    18.2KB        0

# next, we compile the file but having defined PSQN_NO_USE_KAHAN in the 
# beginning
tmp_file <- paste0(tempfile(), ".cpp")
tmp_file_con <- file(tmp_file)
writeLines(
  # add the macro definition to the beginning of the file
  c("#define PSQN_NO_USE_KAHAN", 
    readLines(system.file("generic_example.cpp", package = "psqn"))), 
  tmp_file_con)
close(tmp_file_con)
sourceCpp(tmp_file) # source the file again

# re-compute the gradient and the mean relative error of the gradient
ptr <- get_generic_ex_obj(cpp_arg, max_threads = 4L)
new_g1 <- grad_generic_ex(noise, ptr = ptr, n_threads = 1L)
new_g2 <- grad_generic_ex(noise, ptr = ptr, n_threads = 2L)
new_g3 <- grad_generic_ex(noise, ptr = ptr, n_threads = 3L)

mean(abs((g1     - new_g1) / g1    ))
#> [1] 3.9e-16
mean(abs((new_g1 - new_g2) / new_g1)) 
#> [1] 2.49e-16
mean(abs((new_g1 - new_g3) / new_g1))
#> [1] 2.76e-16

# check the new computation time
bench::mark(
  `1 thread`  = grad_generic_ex(noise, ptr = ptr, n_threads = 1L), 
  `2 threads` = grad_generic_ex(noise, ptr = ptr, n_threads = 2L), 
  `3 threads` = grad_generic_ex(noise, ptr = ptr, n_threads = 3L))
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 1 thread      351µs    356µs     2784.    18.2KB     0   
#> 2 2 threads     159µs    167µs     5821.    18.2KB     0   
#> 3 3 threads     113µs    120µs     8114.    18.2KB     2.02

The error is only slightly larger in the latter case in this example and the computation time is reduced because the func and grad member functions are cheap to evaluate computationally.

Details

Using the Code in a Package

The main part of this packages is a header-only library. Thus, the code can be used within a R package by adding psqn to LinkingTo in the DESCRIPTION file. This is an advantage as one can avoid repeated compilation of the code.

Moreover, since the main part of the code is a header-only library, this package can easily be used within languages which can easily call C++ code.

BFGS Method

There is also a BFGS implementation in the package. This is both available in R through the psqn_bfgs function and in C++ in the psqn-bfgs.h header file. An example is provided below using the example from optim:

# declare function and gradient from the example from help(optim)
fn <- function(x) {
  x1 <- x[1]
  x2 <- x[2]
  100 * (x2 - x1 * x1)^2 + (1 - x1)^2
}
gr <- function(x) {
  x1 <- x[1]
  x2 <- x[2]
  c(-400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1),
     200 *      (x2 - x1 * x1))
}

# we need a different function for the method in this package
gr_psqn <- function(x) {
  x1 <- x[1]
  x2 <- x[2]
  out <- c(-400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1),
            200 *      (x2 - x1 * x1))
  attr(out, "value") <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
  out
}

# we get the same
optim    (c(-1.2, 1), fn, gr, method = "BFGS")$par
#> [1] 1 1
psqn_bfgs(c(-1.2, 1), fn, gr_psqn)            $par
#> [1] 1 1

# they run in about the same time
bench::mark(
  optim     = optim    (c(-1.2, 1), fn, gr, method = "BFGS"), 
  psqn_bfgs = psqn_bfgs(c(-1.2, 1), fn, gr_psqn), 
  check = FALSE, min_time = .5)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 optim       135.7µs    148µs     6592.      384B     10.3
#> 2 psqn_bfgs    98.2µs    106µs     9098.    2.49KB     12.6

BFGS and Partially Separable Quasi-Newton

Below we show the ratio of flops required in the matrix-vector product in a BFGS method relative to the flops required in the matrix-vector product for the conjugate gradient method for the quasi-Newton method:

vals <- expand.grid(n = 2^(8:13), p = 2^(2:4), q = 2^(2:8))
vals <- within(vals, {
  flops_qsn <- 2L * n * (p + q) * (p + q + 1L)
  flops_bfgs <- 2L * (q * n + p)^2
  ratio <- flops_bfgs / flops_qsn
})
nq <- length(unique(vals$q))
tvals <- c(vals$n[seq_len(NROW(vals) / nq)], 
           vals$p[seq_len(NROW(vals) / nq)], floor(vals[, "ratio"]))

vals <- matrix(
  tvals, ncol = nq + 2L, dimnames = list(
    NULL, c("n", "p/q", unique(vals$q))))
knitr::kable(vals)

n	p/q	4	8	16	32	64	128	256
256	4	57	105	156	196	223	238	247
512	4	114	210	312	393	447	477	494
1024	4	228	420	624	787	894	955	988
2048	4	455	840	1248	1574	1787	1911	1977
4096	4	910	1680	2496	3149	3575	3822	3955
8192	4	1820	3361	4993	6297	7151	7645	7911
256	8	26	60	109	160	199	225	239
512	8	52	120	218	320	399	450	479
1024	8	105	241	437	639	798	900	959
2048	8	210	482	874	1279	1596	1801	1918
4096	8	420	964	1748	2557	3192	3601	3837
8192	8	840	1928	3495	5115	6384	7203	7674
256	16	10	27	62	111	162	201	226
512	16	19	55	124	223	323	401	451
1024	16	39	109	248	446	647	803	903
2048	16	78	218	496	892	1294	1607	1807
4096	16	156	437	993	1783	2589	3214	3615
8192	16	312	874	1986	3567	5178	6428	7230

The

\[\frac{\bar q^2}{p^2 + 2p\bar q + \bar q^2}\]

ratio from the section called Conjugate Gradient Method is shown below:

vals <- expand.grid(p = 2^(2:8), q = 2^(2:10))
vals <- within(vals, {
  ratio <- q^2 / (p^2  + 2 * p * q + q^2)
})
tvals <- c(unique(vals$p), vals[, "ratio"])

vals <- matrix(
  tvals, ncol = length(unique(vals$q)) + 1L, dimnames = list(
    NULL, c("p/q", unique(vals$q))))
knitr::kable(vals, digits = 4)

p/q	4	8	16	32	64	128	256	512	1024
4	0.2500	0.4444	0.6400	0.7901	0.886	0.940	0.970	0.985	0.992
8	0.1111	0.2500	0.4444	0.6400	0.790	0.886	0.940	0.970	0.985
16	0.0400	0.1111	0.2500	0.4444	0.640	0.790	0.886	0.940	0.970
32	0.0123	0.0400	0.1111	0.2500	0.444	0.640	0.790	0.886	0.940
64	0.0035	0.0123	0.0400	0.1111	0.250	0.444	0.640	0.790	0.886
128	0.0009	0.0035	0.0123	0.0400	0.111	0.250	0.444	0.640	0.790
256	0.0002	0.0009	0.0035	0.0123	0.040	0.111	0.250	0.444	0.640

We can get rid of the $p^2$ term which gives us:

vals <- expand.grid(p = 2^(2:8), q = 2^(2:10))
vals <- within(vals, {
  ratio <- q^2 / (2 * p * q + q^2)
})
tvals <- c(unique(vals$p), vals[, "ratio"])

vals <- matrix(
  tvals, ncol = length(unique(vals$q)) + 1L, dimnames = list(
    NULL, c("p/q", unique(vals$q))))
knitr::kable(vals, digits = 4)

p/q	4	8	16	32	64	128	256	512	1024
4	0.3333	0.5000	0.6667	0.8000	0.889	0.941	0.970	0.985	0.992
8	0.2000	0.3333	0.5000	0.6667	0.800	0.889	0.941	0.970	0.985
16	0.1111	0.2000	0.3333	0.5000	0.667	0.800	0.889	0.941	0.970
32	0.0588	0.1111	0.2000	0.3333	0.500	0.667	0.800	0.889	0.941
64	0.0303	0.0588	0.1111	0.2000	0.333	0.500	0.667	0.800	0.889
128	0.0154	0.0303	0.0588	0.1111	0.200	0.333	0.500	0.667	0.800
256	0.0078	0.0154	0.0303	0.0588	0.111	0.200	0.333	0.500	0.667

This is implemented.

psqn: Partially Separable Quasi-Newton

Example

Variational Approximations

Quasi-Newton Method for Partially Separable Functions

Using Partial Separability

Conjugate Gradient Method

Line Search and Wolfe Condition

Symmetric Rank-one Updates

Example Using the Implementation

R Implementation

C++ Implementation

Getting the Hessian Approximation

Getting the True Hessian

Using Different Preconditioners

Benchmarking

Masking (Fixing) Parameters

Non-linear Equality Constraints

Polynomial Example

Generic Example

Generic Example: R Interface

Getting the Hessian Approximation

Getting the True Hessian

Using Different Preconditioners

Masking Parameters

Using Kahan Summation Algorithm

Details

Using the Code in a Package

BFGS Method

BFGS and Partially Separable Quasi-Newton

References