Semi-Parametric Distribution Demo

Alexios Galanos

2024-08-22

This provides a quick demo to illustrate the semi-parametric distribution. This distribution estimates the tails of the data using the Generalized Pareto distribution (gpd) whilst the interior is estimated using a kernel density. The tails and interior are then smoothly joined together to form the semi-parametric distribution.

For this demo, we’ll generate a random sample from the Generalized Hyperbolic Skew Student distribution and then model the top and bottom 1% of the data using the Generalized Pareto distribution. Choosing the correct threshold when applying the Generalized Pareto distribution is an important step to ensure that the asymptotic distribution fits the data well and the parameter estimates are stable. There are a number of methods to test for this, and numerous packages in R which can help. The mev package, which is used for estimating the GPD (except in the case of the probability weighted moments approach) provides threshold stability plots and other diagnostics for aid in choosing the right threshold.

There are a total of 4 parameters, 2 for each tail, estimated independently. As such, the standard errors are based on a block diagonal variance-covariance matrix.

Code
library(tsdistributions)
library(data.table)
library(knitr)
# simulate data
set.seed(200)
n <- 6000
sim <- rghst(n, mu = 0, sigma = 1, skew = -0.6, shape = 8.01)
spec <- spd_modelspec(sim, lower = 0.01, upper = 0.99, kernel_type = 'normal')
mod <- estimate(spec, method = "pwm")
print(summary(mod))
#> SPD Model Summary
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> [lower tail] scale  0.63612    0.12620   5.041 4.64e-07 ***
#> [lower tail] shape -0.07680    0.15348  -0.500    0.617    
#> [upper tail] scale  0.50206    0.09886   5.078 3.81e-07 ***
#> [upper tail] shape -0.03801    0.15033  -0.253    0.800    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> N: 6000,  dof: 4,  
#> LogLik: -8338,  AIC:  -16670,  BIC: -16640

We’ll investigate the goodness of fit of the distribution using some visuals:

Code
lower_treshold <- mod$gpd$lower$threshold
upper_treshold <- mod$gpd$upper$threshold
oldpar <- par(mfrow = c(1,1))
par(mfrow = c(3,1), mar = c(3,3,3,3))
hist(sim, breaks = 300, probability = TRUE, xlab = "r", 
     col = "steelblue", border = "whitesmoke",main = "PDF")
box()
abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)
lines(sort(sim), dspd(sort(sim), mod), col = "darkorange",lwd = 2)

plot(sort(sim), (1:length(sim)/length(sim)), 
     ylim = c(0, 1), pch = 19, 
     cex = 0.5, ylab = "p", xlab = "q", main = "CDF")
grid()
lines(sort(sim), pspd(sort(sim), mod), col = "darkorange",lwd = 2)

abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)

plot(sort(sim), qspd(ppoints(length(sim)), mod), ylab = "Model Simulated", xlab = "Observed", main = "Q-Q")
abline(0, 1, col = "darkorange")
grid()
abline(v = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(v = upper_treshold, col = 'mediumpurple3', lty = 2)
abline(h = lower_treshold, col = 'mediumpurple3', lty = 2)
abline(h = upper_treshold, col = 'mediumpurple3', lty = 2)

Code
par(oldpar)

The vertical purple lines show the region modeled by the kernel density, representing the thresholds beyond which the GPD is used.

To calculate expectations from the estimated object we can use quadrature integration, as shown below, with calculations for the mean, variance, skewness, kurtosis, value at risk (1%) and expected shortfall (1%). We choose reasonably large but finite values for the numerical integration limits.

Code
f <- function(x) x * dspd(x, mod)
mu <- integrate(f, -50, 50)$value
f <- function(x) x^2 * dspd(x, mod)
variance <- integrate(f, -50, 50)$value
f <- function(x) (x - mu)^3 * dspd(x, mod)
skewness <- integrate(f, -50, 50)$value/(variance^(3/2))
f <- function(x) (x - mu)^4 * dspd(x, mod)
kurtosis <- integrate(f, -50, 50)$value/(variance^2)
value_at_risk <- qspd(0.01, mod)
f <- function(x) x * dspd(x, mod)
expected_shortfall <- integrate(f, -50, value_at_risk)$value/0.01

tab <- data.table(stat = c("mean","variance","skewness","kurtosis","VaR(1%)","ES(1%)"),
                  spd = c(mu, variance, skewness, kurtosis, value_at_risk, expected_shortfall),
                  observed = c(mean(sim), var(sim), mean((sim - mean(sim))^3)/(sd(sim)^3), 
                               mean((sim - mean(sim))^4)/(sd(sim)^4),quantile(sim, 0.01),
                               mean(sim[sim <= quantile(sim, 0.01)])))
kable(tab, digits = 2)
stat spd observed
mean -0.01 -0.02
variance 1.03 0.98
skewness -0.21 -0.28
kurtosis 4.16 4.28
VaR(1%) -2.79 -2.79
ES(1%) -3.28 -3.38

The results appear promising and close to the observed values. Note that even though we simulated the data from a known distribution, in practice we usually don’t know which distribution the data came from. Therefore, the semi-parametric distribution provides a good trade-off between fully parametric and non-parametric approaches. The caveat is that we still need to be careful in choosing the thresholds. There is no free lunch!