Whittaker-Henderson (WH) smoothing is a gradation method aimed at correcting the effect of sampling fluctuations on an observation vector. It is applied to evenly-spaced discrete observations. Initially proposed by Whittaker (1922) for constructing mortality tables and further developed by the works of Henderson (1924), it remains one of the most popular methods among actuaries for constructing experience tables in life insurance. Extending to two-dimensional tables, it can be used for studying various risks, including but not limited to: mortality, disability, long-term care, lapse, mortgage default, and unemployment.
Let \(\mathbf{y}\) be a vector of observations and \(\mathbf{w}\) a vector of positive weights, both of size \(n\). The estimator associated with Whittaker-Henderson smoothing is given by:
\[ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}}\{F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) + R_{\lambda,q}(\boldsymbol{\theta})\} \]
where:
\(F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = \underset{i = 1}{\overset{n}{\sum}} w_i(y_i - \theta_i)^2\) represents a fidelity criterion to the observations,
\(R_{\lambda,q}(\boldsymbol{\theta}) = \lambda \underset{i = 1}{\overset{n - q}{\sum}} (\Delta^q\boldsymbol{\theta})_i^2\) represents a smoothness criterion.
In the latter expression, \(\Delta^q\) denotes the forward difference operator of order \(q\), such that for any \(i\in[1,n - q]\):
\[ (\Delta^q\boldsymbol{\theta})_i = \underset{k = 0}{\overset{q}{\sum}} \begin{pmatrix}q \\ k\end{pmatrix}(- 1)^{q - k} \theta_{i + k}. \]
Let us define \(W = \text{Diag}(\mathbf{w})\), the diagonal matrix of weights, and \(D_{n,q}\) as the order \(q\) difference matrix of dimensions \((n-q) \times n\), such that \((D_{n,q}\boldsymbol{\theta})_i = (\Delta^q\boldsymbol{\theta})_i\) for all \(i \in [1, n-q]\). The most commonly used difference matrices of order 1 and 2 have the following forms:
\[ D_{n,1} = \begin{bmatrix} -1 & 1 & 0 & \ldots & 0 \\ 0 & -1 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & 0 & -1 & 1 \\ \end{bmatrix} \quad\text{and}\quad D_{n,2} = \begin{bmatrix} 1 & -2 & 1 & 0 & \ldots & 0 \\ 0 & 1 & -2 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & 0 & 1 & -2 & 1 \\ \end{bmatrix}. \]
The fidelity and smoothness criteria can be rewritten in matrix form as:
\[ F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \quad \text{and} \quad R_{\lambda,q}(\boldsymbol{\theta}) = \lambda\boldsymbol{\theta}^TD_{n,q}^TD_{n,q}\boldsymbol{\theta}. \]
The associated estimator for smoothing becomes:
\[ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace \]
where \(P_{\lambda} = \lambda D_{n,q}^TD_{n,q}\).
In the bidimensional case, let us consider a matrix \(Y\) of observations and a matrix \(\Omega\) of non-negative weights, both of dimensions \(n_x \times n_z\). The estimator associated with the Whittaker-Henderson smoothing can be written as:
\[ \widehat{Y} = \underset{\Theta}{\text{argmin}}\{F(Y,\Omega, \Theta) + R_{\lambda,q}(\Theta)\} \]
where:
\(F(Y,\Omega, \Theta) = \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z} \Omega_{i,j}(Y_{i,j} - \Theta_{i,j})^2\) represents a fidelity criterion to the observations,
\(R_{\lambda,q}(\Theta) = \lambda_x \sum_{j = 1}^{n_z}\sum_{i = 1}^{n_x - q_x} (\Delta^{q_x}\Theta_{\bullet,j})_i^2 + \lambda_z \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z - q_z} (\Delta^{q_z}\Theta_{i,\bullet})_j^2\) is a smoothness criterion.
This latter criterion can be written as the sum of two one-dimensional regularization criteria, with orders \(q_x\) and \(q_z\), applied respectively to all rows and all columns of \(\Theta\). It is also possible to adopt matrix notations by defining \(\mathbf{y} = \textbf{vec}(Y)\), \(\mathbf{w} = \textbf{vec}(\Omega)\), and \(\boldsymbol{\theta} = \textbf{vec}(\Theta)\) as the vectors obtained by stacking the columns of the matrices \(Y\), \(\Omega\), and \(\Theta\), respectively. Additionally, let us denote \(W = \text{Diag}(\mathbf{w})\) and \(n = n_x \times n_z\). The fidelity and smoothness criteria can then be rewritten as linear combinations of the vectors \(\mathbf{y}\), \(\mathbf{w}\), and \(\boldsymbol{\theta}\):
\[ \begin{aligned} F(\mathbf{y},\mathbf{w}, \boldsymbol{\theta}) &= (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \\ R_{\lambda,q}(\boldsymbol{\theta}) &= \boldsymbol{\theta}^{T}(\lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}) \boldsymbol{\theta}. \end{aligned} \]
and the estimator associated with the smoothing takes the same form as in the one-dimensional case:
\[ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace \]
except in this case \(P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}\).
Whittaker-Henderson estimator has an explicit solution:
\[\hat{\mathbf{y}} = (W + P_{\lambda})^{-1}W\mathbf{y}.\]
Indeed, as a minimum, \(\hat{\mathbf{y}}\) satisfies:
\[0 = \left.\frac{\partial}{\partial \boldsymbol{\theta}}\right|_{\hat{\mathbf{y}}}\left\lbrace(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\right\rbrace = - 2 X^TW(y - \hat{\mathbf{y}}) +2X^TP_{\lambda} \hat{\mathbf{y}}.\]
It follows that \((W + P_{\lambda})\hat{\boldsymbol{\theta}} = W\mathbf{y}\), and if \(W + P_{\lambda}\) is invertible, then \(\hat{\mathbf{y}}\) is indeed a solution for the original equation. When \(\lambda \neq 0\), \(W + P_{\lambda}\) is invertible as long as \(\mathbf{w}\) has \(q\) non-zero elements in the one-dimensional case, and \(\Omega\) has at least \(q_x \times q_z\) non-zero elements distributed over \(q_x\) different rows and \(q_z\) different columns in the two-dimensional case. These are sufficient conditions, which are always satisfied in practice and will not be demonstrated here.
The WH
package features two main functions
WH_1d
and WH_2d
corresponding to the
one-dimensional and two-dimensional cases respectively. Two arguments
are mandatory for those functions:
The vector (or matrix in the two-dimension case) d
corresponding to the number of observed events of interest by age (or by
age and duration in the two-dimension case). d
should have
named elements (or rows and columns) for the model results to be
extrapolated.
The vector (or matrix in the two-dimension case) ec
corresponding to the portfolio central exposure by age (or by age and
duration in the two-dimension case) whose dimensions should match those
of d
. The contribution of each individual to the portfolio
central exposure corresponds to the time the individual was actually
observed with corresponding age (and duration in the two-dimension cas).
It always ranges from 0 to 1 and is affected by individuals leaving the
portfolio, no matter the cause, as well as censoring and truncating
phenomena.
Additional arguments may be supplied, whose description is given in the documentation of the functions.
The package also embed two fictive agregated datasets to illustrate how to use it:
portfolio_mortality
contains the agregated number of
deaths and associated central exposure by age for an annuity
portfolio.
portfolio_LTC
contains the agregated number of
deaths and associated central exposure by age and duration (in years)
since the onset of LTC for the annuitant database of a long-term care
portfolio.
# One-dimensional case
d <- portfolio_mort$d
ec <- portfolio_mort$ec
WH_1d_fit <- WH_1d(d, ec)
Using outer iteration / Brent method
# Two-dimensional case
keep_age <- which(rowSums(portfolio_LTC$ec) > 5e2)
keep_duration <- which(colSums(portfolio_LTC$ec) > 1e3)
d <- portfolio_LTC$d[keep_age, keep_duration]
ec <- portfolio_LTC$ec[keep_age, keep_duration]
WH_2d_fit <- WH_2d(d, ec)
Using performance iteration / Nelder-Mead method
Functions WH_1d
and WH_2d
output objects of
class "WH_1d"
and "WH_2d"
to which additional
functions (including generic S3 methods) may be applied:
print
function provides a glimpse of the fitted
resultsWH_1d_fit
An object fitted using the WH_1D function
Initial data contains 74 data points:
Observation positions: 19 to 92
Smoothing parameter selected: 13454
Associated degrees of freedom: 7.5
WH_2d_fit
An object fitted using the WH_2D function
Initial data contains 90 data points:
First dimension: 75 to 89
Second dimension: 0 to 5
Smoothing parameters selected: 301 2
Associated degrees of freedom: 13
plot
function generates rough plots of the model
fit, the associated standard deviation, the model residuals or the
associated degrees of freedom. See the plot.WH_1d
and
plot.WH_2d
functions help for more details.predict
function generates an extrapolation of the
model. It requires a newdata
argument, a named list with
one or two elements corresponding to the positions of the new
observations. In the two-dimension case constraints are used so that the
predicted values matches the fitted values for the initial observations
(see Carballo, Durban, and Lee 2021 to understand
why this is required).output_to_df
function converts an
"WH_1d"
or "WH_2d"
object into a
data.frame
. Information about the fit is discarded in the
process. This function may be useful to produce better visualizations
from the data, for example using the ggplot2 package.From the explicit solution of WH smoothing, \(\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{-1}W\mathbb{E}(\mathbf{y}) \ne \mathbb{E}(\mathbf{y})\) when \(\lambda \ne 0\). This implies that penalization introduces a smoothing bias, which prevents the construction of a confidence interval centered around \(\mathbb{E}(\mathbf{y})\). Therefore, in this section, we turn to a Bayesian approach where smoothing can be interpreted more naturally.
Let us suppose that \(\mathbf{y} | \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, W^{-})\) and \(\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})\) where \(P_{\lambda}^{-}\) denotes the pseudo-inverse of the matrix \(P_{\lambda}\). The Bayes’ formula allows us to express the posterior likelihood \(f(\boldsymbol{\theta} | \mathbf{y})\) associated with these choices in the following form:
\[\begin{aligned} f(\boldsymbol{\theta} | \mathbf{y}) &= \frac{f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta})}{f(y)} \\ &\propto f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta}) \\ &\propto \exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right)\exp\left(-\frac{1}{2}\boldsymbol{\theta}^TP_{\lambda} \boldsymbol{\theta}\right) \\ &\propto \exp\left(- \frac{1}{2}\left[(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right]\right). \end{aligned}\]
The mode of the posterior distribution, \(\hat{\boldsymbol{\theta}} = \underset{\boldsymbol{\theta}}{\text{argmax}} [f(\boldsymbol{\theta} | \mathbf{y})]\), also known as the maximum a posteriori (MAP) estimate, coincides with the explicit solution \(\hat{\mathbf{y}}\).
A second-order Taylor expansion of the log-posterior likelihood around \(\hat{\mathbf{y}} = \hat{\boldsymbol{\theta}}\) gives us:
\[\ln f(\boldsymbol{\theta} | \mathbf{y}) = \ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) + \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}^{T}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T} \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\]
\[\text{where} \quad \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = 0 \quad \text{and} \quad \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = - (W + P_{\lambda}).\]
Note that this last derivative no longer depends on \(\boldsymbol{\theta}\), and higher-order derivatives beyond 2 are all zero. The Taylor expansion allows for an exact computation of \(\ln f(\boldsymbol{\theta} | \mathbf{y})\). By substituting the result back into the Taylor expansion, we obtain:
\[\begin{aligned} f(\boldsymbol{\theta} | \mathbf{y}) &\propto \exp\left[\ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) - \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \\ &\propto \exp\left[- \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \end{aligned}\]
which can immediately be recognized as the density of the \(\mathcal{N}(\hat{\boldsymbol{\theta}},(W + P_{\lambda})^{- 1})\) distribution.
The assumption \(\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})\) corresponds to a simple Bayesian formalization of the smoothness criterion. It reflects a (improper) prior belief of the modeler about the underlying distribution of the observation vector \(\mathbf{y}\).
The use of Whittaker-Henderson smoothing in the Bayesian framework and the construction of credible intervals are conditioned on the validity of the assumption \(\mathbf{y} | \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, W^{-})\). The components of the observation vector should be independent and have known variances. The weight vector \(\mathbf{w}\) used should contain the inverses of these variances. Under these assumptions, credible intervals at \(100(1 - \alpha)\)% for smoothing can be constructed and take the form:
\[\mathbb{E}(\mathbf{y}) | \mathbf{y} \in \left[\hat{\mathbf{y}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(W + P_{\lambda})^{-1}\right\rbrace}\right]\]
with probability \(1 -\frac{\alpha}{2}\) where \(\hat{\mathbf{y}} = (W + P_{\lambda})^{- 1}W\mathbf{y}\) and \(\Phi\) denotes the cumulative distribution function of the standard normal distribution. According to Marra and Wood (2012), these credible intervals provide satisfactory coverage of the corresponding frequentist confidence intervals and can therefore be used as practical substitutes.
The previous section highlighted the need to apply the Whittaker-Henderson smoothing to independent observation vectors \(\mathbf{y}\) and weight vectors \(\mathbf{w}\) corresponding to the inverses of the variances of the components of \(\mathbf{y}\) in order to obtain a measure of uncertainty in the results. In this section, we propose, within the framework of duration models used for constructing experience tables, vectors \(\mathbf{y}\) and \(\mathbf{w}\) that satisfy these conditions.
Consider the observation of \(m\) individuals in a longitudinal study subject to left truncation and right censoring phenomena. Suppose we want to estimate a single distribution that depends on only one continuous explanatory variable, denoted by \(x\). For illustration purposes, we can consider it as a mortality distribution with the explanatory variable of interest \(x\) representing age. Such a distribution is fully characterized by one of the following quantities:
Suppose that the considered distribution depends on a vector of parameters \(\boldsymbol{\theta}\) that we want to estimate using maximum likelihood. The likelihood associated with the observation of the individuals can be written as follows:
\[ \mathcal{L}(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\prod}} \left[\frac{f(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{\delta_i}\left[\frac{S(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{1 - \delta_i} \]
Where \(x_i\) represents the age at the start of observation, \(t_i\) represents the observation duration, i.e., the time elapsed between the start and end dates of observation, and \(\delta_i\) is the indicator of event observation, which takes the value 1 if the event of interest is observed and 0 if the observation is censored. We will not go into the details of calculating these three quantities, which should take into account the individual-specific information such as the subscription date, redemption date if applicable, as well as the global characteristics of the product, such as the presence of a deductible or medical selection phenomenon, and the choice of a restricted observation period due to data quality issues or delays in the reporting of observations. These factors typically lead to a narrower observation period than the actual presence period of individuals in the portfolio.
The various quantities introduced above are related by the following relationships:
\[ S(x) = \exp\left(\underset{u = 0}{\overset{x}{\int}}\mu(u)\text{d}u\right)\quad \text{and} \quad f(x) = \mu(x)S(x). \]
The maximization of the likelihood in that equation is equivalent to the maximization of the associated log-likelihood, which can be rewritten using only the instantaneous hazard function (also known as the mortality rate in the case of the death risk):
\[ \ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} \left[\delta_i \ln\mu(x_i + t_i,\boldsymbol{\theta}) - \underset{u = 0}{\overset{t_i}{\int}}\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right] \]
To discretize the problem, let us assume that the mortality rate is piecewise constant over one-year intervals between two integer ages. Formally, we have \(\mu(x + \epsilon) = \mu(x)\) for all \(x \in \mathbb{N}\) and \(\epsilon \in [0,1[\). Furthermore, if \(\mathbf{1}\) denotes the indicator function, then for any \(0 \leq a < x_{\max}\), we have \(\sum_{x = x_{\min}}^{x_{\max}} \mathbf{1}(x \leq a < x + 1) = 1\), where \(x_{\min} = \min(\mathbf{x})\) and \(x_{\max} = \max(\mathbf{x})\). The log-likelihood can be rewritten as:
\[ \begin{aligned} \ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} &\left[\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \delta_i\mathbf{1}(x \le x_i + t_i < x + 1)\ln\mu(x_i + t_i,\boldsymbol{\theta})\right. \\ &- \left.\underset{u = 0}{\overset{t_i}{\int}}\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right]. \end{aligned} \]
The assumption of piecewise constant mortality rate implies that:
\[ \begin{aligned} \mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x_i + t_i,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x,\boldsymbol{\theta})\quad\text{and}\\ \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + u < x + 1) \ln\mu(x,\boldsymbol{\theta}). \end{aligned} \]
It is then possible to interchange the two summations to obtain the following expressions:
\[ \begin{aligned} \ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \left[\ln\mu(x,\boldsymbol{\theta}) d(x) - \mu(x,\boldsymbol{\theta}) e_c(x)\right] \quad \text{where} \\ d(x) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \quad \text{and} \\ e_c(x) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\text{d}u = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x - x_i + 1) - \max(0, x - x_i)\right]^+ \end{aligned} \]
by denoting \(a^+ = \max(a, 0)\), where \(d(x)\) and \(e_c(x)\) correspond to the number of observed deaths between ages \(x\) and \(x + 1\) and the sum of observation durations of individuals between these ages, respectively (the latter quantity is also known as central exposure to risk).
The extension of the proposed approach to the two-dimensional framework requires only minor adjustments to the previous reasoning. Let \(z_{\min} = \min(\mathbf{z})\) and \(z_{\max} = \max(\mathbf{z})\). The piecewise constant assumption for the mortality rate needs to be extended to the second dimension. Formally, we now assume that \(\mu(x + \epsilon, z + \xi) = \mu(x, z)\) for all pairs \(x, z \in \mathbb{N}\) and \(\epsilon, \xi \in [0,1[\). The sums involving the variable \(x\) are then replaced by double sums considering all combinations of \(x\) and \(z\). The log-likelihood is given by:
\[ \begin{aligned} \ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \underset{z = z_{\min}}{\overset{z_{\max}}{\sum}}\left[\ln\mu(x,z,\boldsymbol{\theta}) d(x,z) - \mu(x,z,\boldsymbol{\theta}) \mathbf{e_c}(x,z)\right] \quad \text{where} \\ d(x,z) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \mathbf{1}(z \le z_i + t_i < z + 1) \quad \text{and}\\ \mathbf{e_c}(x,z) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\mathbf{1}(z \le z_i + u < z + 1)\text{d}u \\ & = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x + 1 - x_i, z + 1 - z_i) - \max(0, x - x_i, z - z_i)\right]^+ \end{aligned} \]
The choice \(\boldsymbol{\mu}(\boldsymbol{\theta}) = \exp(\boldsymbol{\theta})\), which corresponds to considering one parameter per observation, allows us to relate to the Whittaker-Henderson smoothing. Using the exponential function ensures positive values for the estimated mortality rate. The expressions of the likelihood in the unidimensional or two-dimensional case can then be written in a common vectorized form:
\[ \ell(\boldsymbol{\theta}) = \boldsymbol{\theta}^{T}\mathbf{d} - \exp(\boldsymbol{\theta})^{T}\mathbf{e_c} \]
where \(\mathbf{d}\) and \(\mathbf{e}_c\) represent the vectors of observed deaths and exposures to risk, respectively. The derivatives of the likelihood function for this model are given by:
\[\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \left[\mathbf{d} -\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}\right] \quad \text{and} \quad \frac{\partial^2 \ell}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^T} = - \text{Diag}(\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}).\]
Note that these likelihood equations are exactly what we would obtain by assuming that the observed numbers of deaths, conditional on the observed exposures to risk \(\mathbf{e}_c\), follow Poisson distributions with parameters \(\boldsymbol{\mu}(\boldsymbol{\theta}) \odot \mathbf{e}_c\). The model presented here has many similarities with a Poisson GLM (John Ashworth Nelder and Wedderburn 1972). However, the initial assumptions are not the same for both models.
These likelihood equations have an explicit solution given by \(\hat{\boldsymbol{\theta}} = \ln(\mathbf{d} / \mathbf{e}_c)\). This model, which treats each age independently, is known as the crude rates estimator. The properties of the maximum likelihood estimator imply that asymptotically \(\hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, W_{\hat{\boldsymbol{\theta}}}^{-1})\), where \(W_{\hat{\boldsymbol{\theta}}}\) is a diagonal matrix with elements \(\exp(\hat{\boldsymbol{\theta}}) \odot \mathbf{e}_c = (\mathbf{d} / \mathbf{e}_c) \odot \mathbf{e}_c = \mathbf{d}\). It should be noted that the asymptotic nature and the validity of this approximation are related to the number of individuals \(m\) in the portfolio and not the size \(n\) of the vectors \(\mathbf{d}\) and \(\mathbf{e}_c\).
Thus, we have shown that in the framework of duration models, using the crude rates estimator, asymptotically \(\ln(\mathbf{d} / \mathbf{e}_c) \sim \mathcal{N}(\ln\boldsymbol{\mu}, W^{-1})\), where \(W = \text{Diag}(\mathbf{d})\). This justifies applying the Whittaker-Henderson smoothing to the observation vector \(y = \ln(\mathbf{d} / \mathbf{e}_c)\) and weight vector \(\mathbf{w} = \mathbf{d}\). We obtain the following associated credibility intervals:
\[\ln\boldsymbol{\mu} | \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\right\rbrace}\right]\]
where \(\hat{\boldsymbol{\theta}} = (\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\text{Diag}(\mathbf{d})[\ln(\mathbf{d}) - \ln \mathbf{e}_c]\). Results on \(\boldsymbol{\mu}\) can be obtained directly by exponentiating the above expression.
In the definition of WH smoothing, \((\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\) represents a fidelity criterion to the observations, and \(\boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\) represents a smoothness criterion. The relative importance of these criteria is controlled by the parameter (or pair of parameters in the two-dimensional case) \(\lambda\). The result of smoothing is highly sensitive to the chosen value of the smoothing parameter. Ideally, the selection of the smoothing parameter is based on the optimization of a statistical criterion, typically belonging to one of two major families.
On one hand, there are criteria based on the minimization of the model’s prediction error, among which the Akaike information criterion (AIC, Akaike 1973) and the generalized cross-validation (GCV, Wahba 1980) are included. The Bayesian information criterion (BIC, Schwarz 1978) has a similar form to AIC, although its theoretical justification is very different, and thus it can be considered part of this group.
On the other hand, there are criteria based on the maximization of a likelihood function known as the marginal likelihood. This type of criterion was introduced by Patterson H. D. (1971) in the Gaussian case, initially under the name of restricted likelihood (REML), and used by Anderssen and Bloomfield (1974) for the selection of smoothing parameters. Wahba (1985) and Kauermann (2005) show that criteria minimizing prediction error have the best asymptotic performance, but their convergence to the optimal smoothing parameters is slower. For finite sample sizes, criteria based on the maximization of a likelihood function are considered a more robust choice by many authors, such as Reiss and Todd Ogden (2009) or Wood (2011). For these reasons, we will favor the marginal likelihood as the selection criterion, especially since this choice naturally fits into the Bayesian framework introduced in the past sections.
Let us start from the notations and assumptions from last sections, namely \(\mathbf{y} | \boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\theta}, W^{-})\) and \(\boldsymbol{\theta} | \lambda \sim \mathcal{N}(0, P_{\lambda} ^{-})\). In a purely Bayesian approach, it would be necessary to define a prior distribution on \(\lambda\) and then estimate the posterior distribution of each parameter vector using methods such as Markov Chain Monte Carlo. The empirical Bayesian approach we adopt seeks to find the value of \(\lambda\) that maximizes the marginal likelihood:
\[\mathcal{L}^m_\text{norm}(\lambda) = f(\mathbf{y} | \lambda) = \int f(\mathbf{y}, \boldsymbol{\theta} | \lambda)\text{d}\boldsymbol{\theta} = \int f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta} |\lambda)\text{d}\boldsymbol{\theta}.\]
This corresponds to the maximum likelihood method applied to the smoothing parameter. Let us explicitly rewrite the expressions of \(f(\mathbf{y} | \boldsymbol{\theta})\) and \(f(\boldsymbol{\theta} |\lambda)\) introduced previously:
\[\begin{aligned} f(\mathbf{y} | \boldsymbol{\theta}) &= \sqrt{\frac{|W|_{+}}{(2\pi)^{n_*}}}\exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right) \\ f(\boldsymbol{\theta} |\lambda) &= \sqrt{\frac{|P_{\lambda}|_{+}}{(2\pi)^{p - q}}} \exp\left(- \frac{1}{2}\boldsymbol{\theta}^{T}P_{\lambda} \boldsymbol{\theta}\right) \end{aligned}\]
where \(|A|_{+}\) denotes the product of the non-zero eigenvalues of \(A\), \(n_*\) is the number of non-zero diagonal elements of \(W\), and \(q\) is the number of zero eigenvalues of \(P_{\lambda}\) (\(q = q_x \times q_z\) in the two-dimensional case). Based on the Taylor expansion performed to obtain credibility intervals, let us recall that:
\[ \ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda) = \ln f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T} (W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda}) \]
which leads to:
\[\begin{aligned} \mathcal{L}^m_\text{norm}(\lambda) &= \int \exp[\ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda)]\text{d}\boldsymbol{\theta} \\ &= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \int \exp\left[- \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda}) \right]\text{d}\boldsymbol{\theta} \\ &= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\ &= f_\mathbf{y}(\mathbf{y} | \hat{\boldsymbol{\theta}}_{\lambda}) f_{\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\ &= \sqrt{\frac{|W|_{+}| P_{\lambda} |_{+}}{(2\pi)^{n_* - q}|W + P_{\lambda}|}} \exp\left(- \frac{1}{2}\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^T P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right]\right). \end{aligned}\]
The associated log-likelihood can be expressed as follows:
\[\begin{aligned} \ell^m_\text{reg}(\lambda) = - \frac{1}{2}&\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^{T}P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right. \\ &\left.- \ln|W|_{+} - \ln|P_{\lambda}|_{+} + \ln|W + P_{\lambda}| + (n_* - q)\ln(2\pi)\right]. \end{aligned}\]
Once the selection of \(\lambda\)
has been determined using the estimator \(\hat{\lambda} =
\underset{\lambda}{\text{argmax}}\: \ell^m_\text{reg}(\lambda)\),
the lack of an explicit solution to this equation forces us to resort to
numerical methods for its resolution. The Newton algorithm could once
again be employed here and is a robust choice. This approach was notably
adopted by Wood (2011). However,
explicitly calculating the derivatives of the likelihood \(\ell^m_\text{reg}\) is rather difficult
from an operational perspective. Instead, we will use general heuristics
such as those provided by Brent (1973) and
John A. Nelder and Mead (1965), which are
applicable to any sufficiently smooth function. These heuristics do not
require derivative calculations and are implemented in the
optimize
and optim
functions of the
statistical programming language \(\mathsf{R}\).
See the upcoming paper available here