In this vignette, we estimate mortality rates, and summary indicators such as life expectancy. We work with data from Iceland. We try several methods for representing age patterns, including one that uses typically patterns from the Human Mortality Database. All our model-based estimates come with measures of uncertainty.
In addition to bage itself, we need package poputils, which contains functions for calculating life expectancy, and rvec, which contains data structures and functions for working with random draws.
library(bage)
library(poputils)
library(rvec)
We use standard tidyverse tools for data manipulation and graphing.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(ggplot2)
The Iceland deaths data is included in the bage package. It contains counts of deaths, and estimates of the mid-year population, disaggregated by age and sex, for the years 1998–2022.
<- bage::deaths
dth
dth#> # A tibble: 5,300 × 5
#> age sex time deaths popn
#> <fct> <fct> <int> <int> <dbl>
#> 1 0 Female 1998 7 2051
#> 2 1 Female 1998 1 2082.
#> 3 2 Female 1998 1 2088
#> 4 3 Female 1998 3 2112.
#> 5 4 Female 1998 1 2204
#> 6 5 Female 1998 0 2232.
#> 7 6 Female 1998 1 2175
#> 8 7 Female 1998 0 2240
#> 9 8 Female 1998 0 2261
#> 10 9 Female 1998 0 2182.
#> # ℹ 5,290 more rows
The oldest age group is 105 years and older.
tail(dth, n = 3)
#> # A tibble: 3 × 5
#> age sex time deaths popn
#> <fct> <fct> <int> <int> <dbl>
#> 1 103 Male 2022 0 1.5
#> 2 104 Male 2022 1 1
#> 3 105+ Male 2022 0 0
The data is sparse. Twenty-two percent of death counts are 0, and half are 3 or less.
|>
dth count(deaths) |>
mutate(percent = round(100 * n / sum(n)),
cumulative_percent = cumsum(percent)) |>
head()
#> # A tibble: 6 × 4
#> deaths n percent cumulative_percent
#> <int> <int> <dbl> <dbl>
#> 1 0 1180 22 22
#> 2 1 701 13 35
#> 3 2 424 8 43
#> 4 3 333 6 49
#> 5 4 239 5 54
#> 6 5 194 4 58
We plot ‘direct’ estimates of death rates in the first and least years of the data. Direct estimates are calculated by dividing the number of deaths by the corresponding population at risk, independently for each combination of age, sex, and time. The results are shown below, on a log scale. The dots at the bottom of the graph represent log-rates of negative infinity, which occur in cells where no deaths were observed. Random variation obscures any patterns before about age 50. After age 50, rates increase more or less linearly.
|>
dth filter(time %in% c(1998, 2010, 2022)) |>
mutate(rate = deaths / popn) |>
ggplot(aes(x = age_mid(age), y = rate)) + ## 'age_mid()' returns the mid point
facet_grid(vars(sex), vars(time)) + ## of the age group, which is useful
geom_point(alpha = 0.5) + ## for plotting
scale_y_log10() +
ggtitle("Direct estimates of mortality rates")
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
#> Warning: Removed 6 rows containing missing values or values outside the scale range
#> (`geom_point()`).
We fit an initial simple model. In our prior model, we allow for an interaction between age and sex, but do not allow for any interactions involving time. We assume, in other words, that mortality rates rise or fall at the same rate across all age-sex groups. This assumption is unlikely to be met exactly in practice, so we revisit it later. We accept all the default priors for main effects and interactions. As can be seen at the bottom of the printout, mod_pois()
has guessed which variables represent age, sex, and time. (It bases its guesses on the variable names.) Age and time main effects, by default, get “random walk” priors, and other terms get “normal” priors.
But before running the model there is an annoying, but common, problem with the data with.
It turns out there are 5 cases with non-zero deaths but zero population.
|>
dth filter(deaths > 0 & popn == 0)
#> # A tibble: 5 × 5
#> age sex time deaths popn
#> <fct> <fct> <int> <int> <dbl>
#> 1 104 Male 2001 1 0
#> 2 104 Male 2007 1 0
#> 3 104 Male 2012 1 0
#> 4 104 Female 2019 2 0
#> 5 105+ Male 2020 1 0
We could edit the data directly, removing or modify these cases. However, dealing with these cases is a modelling decision, which we would like to include in the model itself. We make use of the fact that exposure can be specified via a formula to modify cases where we have deaths but no population. Our model is then
Our model is then
<- mod_pois(deaths ~ age * sex + time,
mod_base data = dth,
exposure = ~ ifelse(deaths > 0 & popn == 0, 0.5, popn))
mod_base#> -- Unfitted Poisson model --
#>
#> deaths ~ age * sex + time
#>
#> (Intercept) ~ NFix()
#> age ~ RW()
#> sex ~ NFix()
#> time ~ RW()
#> age:sex ~ RW()
#>
#> dispersion: mean=1
#> exposure: ~ifelse(deaths > 0 & popn == 0, 0.5, popn)
#> var_age: age
#> var_sexgender: sex
#> var_time: time
#> n_draw: 1000
The model implemented by mod_base
assumes that
\[\begin{equation}
y_i \sim \text{Poisson}(\gamma_i w_i)
\end{equation}\]
where
By modelling deaths as draws from a Poisson distribution, the model recognizes the contribution of individual-level randomness to observed death counts. Recognizing individual-level randomness is important when analyzing data where cell counts are small.
Death rates \(\gamma_i\) are treated as draws from a Gamma distribution,
\[\begin{equation} \gamma_i \sim \text{Gamma}(\xi^{-1}, (\xi \mu_i)^{-1}). \end{equation}\]
The expected value for \(\gamma_i\) is \(\mu_i\), and the variance is \(\xi \mu_i^2\). By modelling \(\gamma_i\) as a draw from a distribution, we are recognizing our model is only able to explain some of the variation in \(\gamma_i\). Larger values for \(\xi\) imply more unexplained variation.
We model \(\mu_i\), on the log scale, as a sum of factors formed from the the dimensions of the data, \[\begin{equation} \log \mu_i = \sum_{m=0}^4 \beta_{j_i^m}^{(m)} \end{equation}\] where
Each of the \(\beta^{(m)}\) in the model for \(\mu_i\) receives a prior. The default prior for the intercept is \[\begin{equation} \beta^{(0)} \sim \text{N}(0, 10^2) \end{equation}\] This prior allows for a much broader range of values than we would expect to see in practice. We would only rarely expect to see values of \(\gamma_i\) that were lower than 0.001 (which is -7 on a log scale) or higher than 1 (which is 0).
The default prior for age is a first-order random walk, \[\begin{equation} \beta_j^{(1)} \sim \text{N}(\beta_{j-1}^{(1)}, \tau_1^2), \quad j = 2, \dots, J_1 \end{equation}\] A random walk prior embodies the idea that we expect changes from one age group to the next to be relatively small, and that neighboring age groups are more strongly correlated than distant age groups.
The value for the standard deviation parameter \(\tau_1\) is estimated as part of the model, and has its own prior, \[\begin{equation} \tau_1^2 \sim \text{N}^+(0, 1) \end{equation}\] \(\text{N}^+(0, 1)\) denotes a half-normal distribution, that is, a normal distribution restricted to non-negative values.
For this prior to be properly identified, we need to place some constraint on the overall level. We use a soft constraint on the mean, \[\begin{equation} \frac{1}{J_1}\sum_{j=1}^{J_1} \beta_j^{(1)} \sim \text{N}(0, 1) \end{equation}\]
The sex main effect has the prior \[\begin{align} \beta_j^{(2)} & \sim \text{N}(0, \tau_2^2), \quad j = 1, 2 \\ \tau_2^2 & \sim \text{N}^+(0, 1) \end{align}\]
The time term \(\beta^{(3)}\) has the same random-walk prior as the age effect. The age-sex interaction \(\beta^{(4)}\) has the same prior as the sex effect.
Finally, the dispersion term \(\xi\) has a penalized-complexity prior [@simpson2022priors], \[\begin{equation} p(\xi) = \frac{1}{2 \sqrt{\xi}}e^{-\sqrt{\xi}}. \end{equation}\]
We fit the model by calling function fit()
:
<- fit(mod_base)
mod_base
mod_base#> -- Fitted Poisson model --
#>
#> deaths ~ age * sex + time
#>
#> (Intercept) ~ NFix()
#> age ~ RW()
#> sex ~ NFix()
#> time ~ RW()
#> age:sex ~ RW()
#>
#> dispersion: mean=1
#> exposure: ~ifelse(deaths > 0 & popn == 0, 0.5, popn)
#> var_age: age
#> var_sexgender: sex
#> var_time: time
#> n_draw: 1000
To extract estimated rates from the fitted model object, we use function augment()
.
<- augment(mod_base)
aug_base
aug_base#> # A tibble: 5,300 × 8
#> age sex time deaths popn .observed .fitted
#> <fct> <fct> <int> <int> <dbl> <dbl> <rdbl<1000>>
#> 1 0 Female 1998 7 2051 0.00341 0.002 (0.0017, 0.0026)
#> 2 1 Female 1998 1 2082. 0.000480 0.00049 (0.00038, 0.00066)
#> 3 2 Female 1998 1 2088 0.000479 0.00026 (0.00019, 0.00036)
#> 4 3 Female 1998 3 2112. 0.00142 0.00017 (0.00011, 0.00024)
#> 5 4 Female 1998 1 2204 0.000454 0.00012 (8.2e-05, 0.00018)
#> 6 5 Female 1998 0 2232. 0 0.00011 (7.3e-05, 0.00016)
#> 7 6 Female 1998 1 2175 0.000460 9.3e-05 (6.3e-05, 0.00014)
#> 8 7 Female 1998 0 2240 0 7.6e-05 (5e-05, 0.00012)
#> 9 8 Female 1998 0 2261 0 7.8e-05 (5e-05, 0.00012)
#> 10 9 Female 1998 0 2182. 0 8.1e-05 (5.4e-05, 0.00012)
#> # ℹ 5,290 more rows
#> # ℹ 1 more variable: .expected <rdbl<1000>>
Function augment()
starts with the original data, and adds
.observed
containing direct estimates (\(y_i/w_i\)),.fitted
containing estimates of the rates (\(\gamma_i\)), and.expected
containing expected values for the rates (\(\mu_i\)).The .fitted
and .expected
columns both consist of rvecs. An rvec is a vector-like object that holds multiple draws, in this case draws from posterior distributions.
Next we extract rates estimates for selected years, and summarize the posterior distributions.
<- aug_base |>
rates_base filter(time %in% c(1998, 2010, 2022)) |>
select(age, sex, time, .observed, .fitted) |>
mutate(draws_ci(.fitted))
rates_base#> # A tibble: 636 × 8
#> age sex time .observed .fitted .fitted.lower
#> <fct> <fct> <int> <dbl> <rdbl<1000>> <dbl>
#> 1 0 Female 1998 0.00341 0.002 (0.0017, 0.0026) 0.00166
#> 2 1 Female 1998 0.000480 0.00049 (0.00038, 0.00066) 0.000376
#> 3 2 Female 1998 0.000479 0.00026 (0.00019, 0.00036) 0.000186
#> 4 3 Female 1998 0.00142 0.00017 (0.00011, 0.00024) 0.000113
#> 5 4 Female 1998 0.000454 0.00012 (8.2e-05, 0.00018) 0.0000820
#> 6 5 Female 1998 0 0.00011 (7.3e-05, 0.00016) 0.0000727
#> 7 6 Female 1998 0.000460 9.3e-05 (6.3e-05, 0.00014) 0.0000631
#> 8 7 Female 1998 0 7.6e-05 (5e-05, 0.00012) 0.0000504
#> 9 8 Female 1998 0 7.8e-05 (5e-05, 0.00012) 0.0000499
#> 10 9 Female 1998 0 8.1e-05 (5.4e-05, 0.00012) 0.0000540
#> # ℹ 626 more rows
#> # ℹ 2 more variables: .fitted.mid <dbl>, .fitted.upper <dbl>
We plot point estimates and 95% credible intervals for the modelled estimates, together with the original direct estimates,
ggplot(rates_base, aes(x = age_mid(age),
ymin = .fitted.lower,
y = .fitted.mid,
ymax = .fitted.upper)) +
facet_grid(vars(sex), vars(time)) +
geom_ribbon(fill = "lightblue") +
geom_line(col= "darkblue") +
geom_point(aes(y = .observed),
size = 0.2) +
scale_y_log10() +
ggtitle("Modelled and direct estimates of mortality rates - base model")
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
#> Warning: Removed 6 rows containing missing values or values outside the scale range
#> (`geom_point()`).
We can gain insights into the model by extracting and graphing estimates of the main effects and interactions, the \(\pmb{\beta}^{(m)}\).
Estimates of the main effects and interactions, and of other higher-level parameters, can be obtained with function components()
.
<- components(mod_base)
comp_base
comp_base#> # A tibble: 350 × 4
#> term component level .fitted
#> <chr> <chr> <chr> <rdbl<1000>>
#> 1 (Intercept) effect (Intercept) -1.5 (-3.2, 0.14)
#> 2 age effect 0 -0.85 (-1.2, -0.5)
#> 3 age effect 1 -2.2 (-2.6, -1.8)
#> 4 age effect 2 -2.9 (-3.3, -2.5)
#> 5 age effect 3 -3.3 (-3.7, -2.9)
#> 6 age effect 4 -3.6 (-4, -3.1)
#> 7 age effect 5 -3.7 (-4.1, -3.2)
#> 8 age effect 6 -3.8 (-4.3, -3.4)
#> 9 age effect 7 -4 (-4.5, -3.6)
#> 10 age effect 8 -4 (-4.5, -3.5)
#> # ℹ 340 more rows
components()
returns a tibble contain estimates of all the higher-level parameters. To extract the age effect, and prepare it for graphing, we use
<- comp_base |>
age_effect filter(component == "effect",
== "age") |>
term mutate(draws_ci(.fitted))
A graph of the age effect reveals a typical profile for mortality rates,
ggplot(age_effect,
aes(x = age_mid(level),
y = .fitted.mid,
ymin = .fitted.lower,
ymax = .fitted.upper)) +
geom_ribbon(fill = "lightblue") +
geom_line() +
ggtitle("Age effect")
The Human Mortality Database provides estimates of mortality rates for many countries. Let \(\pmb{M}\) denote a matrix holding these estimates, on a log scale. By applying a Singlar Value Decomposition to \(\pmb{M}\), and then rescaling, we can construct a matrix \(\pmb{F}\) and a column vector \(\pmb{g}\), that parsimoniously summarize the age-profiles, or age-sex profiles, that are observed in \(\pmb{M}\). If \(z\) is a vector of independent draws from a \(\text{N}(0,1)\) distribution, then the resulting vector \[\begin{equation} \pmb{m} = \pmb{F} \pmb{z} + \pmb{g} \end{equation}\] looks like a randomly-selected column from \(\pmb{M}\).
We use \(\pmb{F}\) and \(\pmb{g}\) to build priors for age effects and age-sex interactions. The prior for age effects is \[\begin{align} \pmb{\beta}^{(1)} & = \pmb{F} \pmb{\alpha} + \pmb{g} \\ \alpha_k & \sim \text{N}(0, 1), \quad k = 1, \cdots, K \end{align}\] where \(K\) is chosen by the user. The prior for age-sex interactions looks the same except that, by default, the calculations are done separately for each sex.
<- mod_pois(deaths ~ age:sex + time,
mod_hmd data = dth,
exposure = ~ ifelse(deaths > 0 & popn == 0, 0.5, popn)) |>
set_prior(age:sex ~ SVD(HMD))
mod_hmd#> -- Unfitted Poisson model --
#>
#> deaths ~ age:sex + time
#>
#> (Intercept) ~ NFix()
#> time ~ RW()
#> age:sex ~ SVD(HMD)
#>
#> dispersion: mean=1
#> exposure: ~ifelse(deaths > 0 & popn == 0, 0.5, popn)
#> var_age: age
#> var_sexgender: sex
#> var_time: time
#> n_draw: 1000
<- fit(mod_hmd)
mod_hmd
mod_hmd#> -- Fitted Poisson model --
#>
#> deaths ~ age:sex + time
#>
#> (Intercept) ~ NFix()
#> time ~ RW()
#> age:sex ~ SVD(HMD)
#>
#> dispersion: mean=1
#> exposure: ~ifelse(deaths > 0 & popn == 0, 0.5, popn)
#> var_age: age
#> var_sexgender: sex
#> var_time: time
#> n_draw: 1000
<- augment(mod_hmd)
aug_hmd
<- aug_hmd |>
rates_hmd filter(time %in% c(1998, 2010, 2022)) |>
select(age, sex, time, .observed, .fitted) |>
mutate(draws_ci(.fitted))
ggplot(rates_hmd, aes(x = age_mid(age),
ymin = .fitted.lower,
y = .fitted.mid,
ymax = .fitted.upper)) +
facet_grid(vars(sex), vars(time)) +
geom_ribbon(fill = "lightblue") +
geom_line(col= "darkblue") +
geom_point(aes(y = .observed),
size = 0.2) +
scale_y_log10() +
ggtitle("Modelled and direct estimates of mortality rates - HMD model")
#> Warning in scale_y_log10(): log-10 transformation introduced infinite values.
#> Warning: Removed 6 rows containing missing values or values outside the scale range
#> (`geom_point()`).
<- components(mod_hmd)
comp_hmd
<- comp_hmd |>
age_sex_interact filter(component == "effect",
== "age:sex") |>
term separate_wider_delim(level, delim = ".", names = c("age", "sex")) |>
mutate(draws_ci(.fitted))
ggplot(age_sex_interact,
aes(x = age_mid(age),
y = .fitted.mid,
ymin = .fitted.lower,
ymax = .fitted.upper)) +
geom_ribbon(aes(fill = sex),
alpha = 0.3) +
geom_line(aes(col = sex)) +
ggtitle("Age-sex interaction")
<- replicate_data(mod_base, condition_on = "expected")
rep_data_base
<- rep_data_base |>
data filter(time == 2022) |>
select(-popn) |>
pivot_wider(names_from = sex, values_from = deaths) |>
mutate(diff = Female - Male)
ggplot(data, aes(x = age_mid(age), y = diff)) +
facet_wrap(vars(.replicate)) +
geom_point(size = 0.2)
<- replicate_data(mod_hmd, condition_on = "expected")
rep_data_hmd
<- rep_data_hmd |>
data filter(time == 2022) |>
select(-popn) |>
pivot_wider(names_from = sex, values_from = deaths) |>
mutate(diff = Female - Male)
ggplot(data, aes(x = age_mid(age), y = diff)) +
facet_wrap(vars(.replicate)) +
geom_point(size = 0.2)
<- mod_hmd |>
lifeexp_hmd augment() |>
lifeexp(mx = .fitted,
by = c(time, sex))
lifeexp_hmd#> # A tibble: 50 × 3
#> time sex ex
#> <int> <fct> <rdbl<1000>>
#> 1 1998 Female 82 (81, 82)
#> 2 1998 Male 78 (77, 78)
#> 3 1999 Female 82 (81, 82)
#> 4 1999 Male 78 (78, 78)
#> 5 2000 Female 82 (82, 82)
#> 6 2000 Male 78 (78, 79)
#> 7 2001 Female 82 (82, 83)
#> 8 2001 Male 79 (78, 79)
#> 9 2002 Female 82 (82, 83)
#> 10 2002 Male 79 (78, 79)
#> # ℹ 40 more rows
<- mod_hmd |>
lifeexp_hmd augment() |>
lifeexp(mx = .fitted,
by = c(time, sex))
lifeexp_hmd#> # A tibble: 50 × 3
#> time sex ex
#> <int> <fct> <rdbl<1000>>
#> 1 1998 Female 82 (81, 82)
#> 2 1998 Male 78 (77, 78)
#> 3 1999 Female 82 (81, 82)
#> 4 1999 Male 78 (78, 78)
#> 5 2000 Female 82 (82, 82)
#> 6 2000 Male 78 (78, 79)
#> 7 2001 Female 82 (82, 83)
#> 8 2001 Male 79 (78, 79)
#> 9 2002 Female 82 (82, 83)
#> 10 2002 Male 79 (78, 79)
#> # ℹ 40 more rows