Automatic Structural Time Series Model

autostm (Automatic Structural Time Series Model) is designed to automatically detect the appropriate decomposition for a univariate time series into trend, seasonal, and cycle components using a state space approach. The package also has the functionality perform structural interpolation: i.e. interpolated a quarterly series into a monthly series using either end of period, period average, or period sum methods. The unobserved components are estimated with the Kalman filter and all model parameters are estimated using maximum likelihood. The Kalman filter and smoother are implemented in Rcpp so it is reasonably fast. This package is designed to handle time series of any frequency (standard or not), series with multiple seasonalities, seasonalities with fractional periodicities, missing data imputation, and irregularly spaced dates (i.e. daily data with missing data due to weekends and holidays, etc.).

With ease-of-use in mind, all the user has to do is provide a two column data frame: one column with dates and one column with the univariate time series. Everything else is handled by the algorithm including data frequency detection (observations per year), trend specification, cycle specification, seasonal specification, missing data imputation, anomaly detection, structural break detection, and whether or not to log transform the data. The user can manually set the preferred decomposition if desired, however. All components are assumed to be stochastic, which allows for time-varying trend, cycle, and seasonalities, unless the user specifies otherwise using the det_trend, det_drift, det_seas, det_cycle, and det_obs arguments in stsm_estimate. The algorithm can handle secondly, minutely, hourly, daily, monthly, and quarterly frequencies as well as daily data that is weekday only. It is also capable of handling non-standard frequencies (i.e. frequencies that are non of the above). To build an interpolation model, all the user has to do is set interpolate and interpolate_method. The user only needs to call stsm_estimate to fit the model and stsm_forecast to filter and forecast the data. Additionally, the user can call stsm_detect_anomalies and stsm_detect_breaks to perform anomaly and structural break detection respectively.

State Space Model

The model is based on the decomposition

\[ Y_t = T_t + C_t + S_t + B^o X^o_t + e_t \]

where \(Y_t\) is a univariate time series, \(T_t\) is the trend component, \(C_t\) is the cycle component, \(S_t\) is the seasonal component, \(X^o_t\) is optional exogenous data in the observation equation, and \(e_t \sim N(0, \sigma_e^2)\) is the observation error. The seasonal and cycle components are assumed to be of a trigonometric form, while the trend is assumed to be one of three types: a random walk, a random walk with an AR(1) drift, or a double random walk (random walk with a random walk drift).

The state space model is written as \[ Y_t = A + H \beta + B^o X^o_t + e_t, e_t \sim N(0, R)\\ \beta_t = D + F \beta_{t-1} + B^s X^s_t + u_t, u_t \sim N(0, Q) \]

where the first equation is the observation equation and the second equation is the transition equation. \(A\) is the observation intercept, \(H\) is the observation matrix, \(e_t \sim N(0, R)\) is the observation error, and \(R\) is the observation error-covariance matrix. \(\beta_t\) is the vector of the state components, \(D\) is the state intercept matrix, \(F\) is the transition matrix, \(X^s_t\) is optional exogenous data in the state equation, \(v_t \sim N(0, Q_{s_t})\) is the state error, and \(Q\) is the state error-covariance matrix.

Trend Models

If trend is “random-walk” the trend model is specified as the I(1) process \[ T_t = T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \]

If trend is “random-walk-drift”, the trend model is specified as the I(1) process \[ T_t = D_{t-1} + T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \\ D_t = d + \phi_d D_{t-1} + n_t, n_t \sim N(0, \sigma_d^2) \] where \(D_t\) is the drift.

If trend is “double-random-walk”, the trend model is specified as the I(2) process \[ T_t = D_{t-1} + T_{t-1} + e_t, e_t \sim N(0, \sigma_t^2) \\ D_t = D_{t-1} + n_t, n_t \sim N(0, \sigma_d^2) \]

Cycle Model

The cycle is modeled using either the trigonometric process

\[ \begin{bmatrix} c_t \\ c_t^{*} \end{bmatrix} = \phi_c \begin{bmatrix} cos(\lambda) & sin(\lambda) \\ -sin(\lambda) & cos(\lambda) \end{bmatrix} \begin{bmatrix} c_{t-1} \\ c_{t-1}^{*} \end{bmatrix} + \begin{bmatrix} u_t \\ u_t^{*} \end{bmatrix}, u_t, u_t^{*} \sim N(0, \sigma_c^2) \] where \(\lambda\) is the frequency of the cycle and \(\phi_c \in{(0,1)}\) to make the cycle stationary for forecasting purposes, or the ARMA(2,1) process

\[ \begin{bmatrix} c_t \\ c_{t-1} \\ \vdots \\ c_{t-p+2} \\ c_{t-p+1} \\ e_t \\ e_{t-1} \\ \vdots \\ e_{t-q+2} \\ e_{t-q+1} \end{bmatrix} = \begin{bmatrix} \phi_{c,1} & \phi_{c,2} & \ldots & \phi_{c,p-1} & \phi_{c,p} & \theta_{c,1} & \theta_{c,2} & \ldots & \theta_{c,q-1} & \theta_{c,q} \\ 1 & 0 & \ldots & 0 & 0 & \ldots & \ldots & \ldots & \ldots & 0 \\ 0 & \ddots & & \vdots & \vdots & \ddots & & \ddots & & \vdots \\ \vdots & & \ddots & \vdots & \vdots & & \ddots& & \ddots & \vdots \\ 0 & \ldots & \dots & 1 & 0 & \ldots & \ldots & \ldots & \dots & 0 \\ 0 & \ldots & \ldots & \ldots & 0 & \ldots & \ldots & \ldots & \ldots & 0 \\ \vdots & \ddots & & & \vdots & 1 & 0 & \ldots & 0 & \vdots \\ \vdots & & \ddots & & \vdots & 0 & \ddots & & \vdots & \vdots \\ \vdots & & & \ddots & \vdots & \vdots & & \ddots & \vdots & \vdots \\ 0 & \ldots & \ldots & \ldots & 0 & 0 & \ldots & \ldots & 1 & 0\\ \end{bmatrix} \begin{bmatrix} c_{t-1} \\ c_{t-2} \\ \vdots \\ c_{t-p+1} \\ c_{t-p} \\ e_{t-1} \\ e_{t-2} \\ \vdots \\ e_{t-q+1} \\e_{t-q} \end{bmatrix} + \begin{bmatrix} e_t \\ 0 \\ \vdots \\ \vdots \\ 0 \\ e_t \\ 0 \\ \vdots \\ \vdots \\0 \end{bmatrix}, e_t \sim N(0, \sigma_c^2) \] where \(p\) is the autoregressive lag length and \(q\) is the moving average lag length. The ARMA(p,q) specification can have real or complex roots, while the trigonometric specification will only have complex roots. The trigonometric specification is a special case of an ARMA(2,1) specification.

If the roots are complex, the dynamics will be damped oscillations about the trend, which may not always be desirable. If the user would like to remove oscillations from the forecast that are caused by the complex roots, the user can specify, dampen_cycle = TRUE in stsm_forecast. This option will smooth the forecast of the cycle into the trend using a fitted sigmoid curve.

Seasonal Model

Seasonality is modeled using a Fourier series, which is the sum of sin-cos waves with each wave representing a seasonal pattern. Total seasonality is \[ s_t = \sum_{j = 1}^{h} s_t^j \]

where each season \(s_t^j\) is modeled using the trigonometric specification below

\[ \begin{bmatrix} s_t^j \\ s_t^{j,*} \end{bmatrix} = \begin{bmatrix} cos(\lambda_j) & sin(\lambda_j) \\ -sin(\lambda_j) & cos(\lambda_j) \end{bmatrix} \begin{bmatrix} s_{t-1}^j \\ s_{t-1}^{j,*} \end{bmatrix} + \begin{bmatrix} w_t^j \\ w_t^{j,*} \end{bmatrix}, w_t^j \sim N(0, \sigma_j^2) \] where \(s_t^{j,*}\) exists by construction and is just an auxiliary variable, \(\lambda_j = \frac{2\pi j}{f}\) is the frequency of season \(j\), \(j\) represents the number of periods per year (i.e. \(j = 52\) represents weekly seasonality at 52 periods per year), and \(f\) is the frequency of the data (i.e. the number of observations per year). The seasons to include in the model are automatically detected in the algorithm using wavelet analysis. Each \(\lambda_j\) is predetermined from wavelet analysis and are not estimated parameters.

Model Specification

The automatic model specification algorithm follows the procedure below. Rather than using brute force selection by checking every possible model combination and selecting the model that minimizes a selection criteria like AIC, which can be computationally intensive and time consuming, the procedure uses a series a statistical tests to determine model specification. However, the user is free to write their own brute force model selection routine from the functions provided in this package as stsm_estimate returns a table with the model’s log likelihood, AIC, AICc, and BIC.

Frequency Detection

The first step is to detect the frequency of the data by examining the sequence of dates provided. If the day difference in subsequent dates is closest to

  • \(1\), then a frequency of 365.25 is used for daily data
  • \(7\), then a frequency of 52.17857 is used for weekly data
  • \(30\), then a frequency of 12 is used for monthly data
  • \(90\), then a frequency of 4 is used for quarterly data
  • \(365\), then a frequency of 1 used for yearly data

It can also detect sub-daily data if the day difference in subsequent dates is closest to - \(\frac{1}{24}\), then a frequency of 8760 is used for hourly data - \(\frac{1}{60 \cdot 24}\), then a frequency of 525600 is used for minutely data - \(\frac{1}{60 \cdot 60 \cdot 24}\), then a frequency of 31536000 is used for secondly data

The numeric frequency reflects the number of periods in one year. Once the frequency is detected, a sequence of continuous dates is constructed to ensure evenly spaced data. If an observation is missing for any of the dates in the constructed sequence, it is treated as a missing value. However, if the dates provided contain only weekdays and the data is of daily frequency for example, the frequency is changed to \(365.25 \cdot \frac{5}{7} = 260.8929\) to reflect the number of weekdays per year. The same concept is applied to sub-daily data with weekday observations only as well. If none of the above frequencies are detected, then the frequency is set to the number of rows in the data to allow the algorithm to search for seasonal patterns. A flag standard_freq will be set to FALSE in the final output.

Seasonal and Cycle Adjustment

Before any of the model specification tests can be performed, the data must be seasonally and cycle adjusted. This procedure is performed by decomposing the time series into a loess trend and a seasonal-cycle remainder. The loess function in the stats package requires that there are no missing values in the data, so any missing values are computed using the Kalman filter with a local trend using the stats package’s StructTS and KalmanSmooth functions. If you know the missing data pattern in your data (i.e missing data should be 0), it’s best to fill in the missing data using your knowledge before allowing this method to fill in the rest. A smoother cycle is also estimated from the seasonal-cycle component using a smoothing spline from the smooth.spline function of the stats package. Once the seasonal and cycle periods are known, the seasonal-cycle is more accurately decomposed into the seasonal and cycle components by first decomposing the seasonal-cycle using the mstl function from the forecast package then regressing the seasonal component on a Fourier series composed of the seasonal periods.

Seasonality Detection

The first step is wavelet analysis for seasonal detection, which is done using a series of harmonic regressions in parallel on the detrended data. The detrended data may also be rescaled by the trend if the Cox-Stuart deviation test suggests that the detrended data does not have a constant amplitude. The regression is performed individually for each seasonal harmonic using the equation

\[ \tilde{y_t} = a_j sin(2\pi\frac{j}{f}) + b_j cos(2\pi\frac{j}{f}) \]

for harmonics \(j\) that correspond to the spectrum of minutely, hourly, workday hours, half daily, daily, weekend, weekday, weekly, monthly, quarterly, semi-yearly, and yearly seasonal frequencies where applicable. However, since this limits the number of frequencies tested, a linear space of 100 items is created between each harmonic so that the space captures 1% of the distance between the original two harmonics. If a nonstandard frequency is detected, the harmonics are set to \(j = \lfloor \frac{f}{2} \rfloor\), but only taking 1000 equally spaced harmonics for harmonics that have at least 3 cycles in the data. This strategy limits the number of harmonics tested for speed, while also trying to capture some a priori important seasonalities based on the frequency of the data.

At each iteration, an F-test is performed using the waldtest function from the lmtest package to measure the significance of the harmonic in explaining the seasonality, understanding that this will not be robust to heteroskedasticity or autocorrelation. This accuracy is sacrificed for speed at this step. The frequencies (defined as \(\frac{f}{j}\)) are then matched to the spectrum of seasonalities that correspond to the a priori important seasonal spectrum if the frequency of the harmonic is within one delta of the frequency of this predefined spectrum. This delta is equal to the unique difference between the frequencies of the harmonics.

Lastly, backward stepwise regression procedure using an F-test for the significance of each seasonality (i.e. joint significance of \(a_j\) and \(b_j\) is used to build the final seasonality specification starting with only the harmonics that were statistically significant from the first step. At each stage of the of the stepwise procedure, an F-test for the statistical significance of each seasonality is performed and any harmonics that are not statistically significant are removed. The process is performed iteratively until all remaining harmonics are statistically significant. The F-tests for this procedure are calculated using HAC standard errors to be robust to heteroskedasticity and autocorrelation to reduce the chance of selecting spurious seasonalities.

Cycle Detection

Next, if cycle = NULL, wavelet analysis is performed to detect a trigonometric form for the cycle. If one is not detected, it is set to an ARMA(p,q) specification if the model prior for the cycle is determined to be stationary. If cycle = "trig", wavelet analysis is performed to detect a trigonometric form, but if none is detected, it will default to no cycle. If cycle = "arma", then wavelet analysis will not be performed and the cycle will be set the ARMA(p,q) form. In addition to setting cycle = "arma", the user can set the parameter arma = c(p = p, q = q) to specify a specific ARMA order, or if left as the default with p and q set to NA, then the order will be auto-selected.

Wavelet analysis for cycle detection is done using a series of harmonic regressions in parallel on the detrended data. The detrended data may also be rescaled by the trend if the Cox-Stuart deviation test suggests that the detrended data does not have a constant amplitude. The regression is performed individually for each harmonic using the equation

\[ \tilde{y_t} = a_j sin(2\pi\frac{j}{f}) + b_j cos(2\pi\frac{j}{f}) \]

for harmonics \(j = 0.01\) to \(j = 0.99\) for every 0.01 and truncated for harmonics that correspond to periods \(\in [ 2.5f, T]\) and harmonics < \(1/2\) where \(T\) is the length of the data.

At each iteration, an F-test is performed using the waldtest function from the lmtest package to measure the significance of the harmonic to explain the seasonality, understanding that this will not be robust to heteroskedasticity or autocorrelation. This accuracy is sacrificed for speed at this step. Once all harmonics have been tested, the algorithm looks for a global maximum from the power spectrum measured with the F-statistic that is statistically significant.

Lastly, a final regression is performed with only the harmonic that was selected from the first step. From this regression, the harmonic is kept only if the cycle is statistically significant (i.e. \(a_j\) and \(b_j\) are jointly significant) as measured by the F-test for the final harmonic regression. The F-test for this step is calculated using HAC standard errors to be robust to heteroskedasticity and autocorrelation to reduce the chance of selecting spurious seasonalities.

Multiplicative vs Additive Detection

The third step in the model construction is to determine if the model should be based on a multiplicative of additive model. However, after the seasonal and cycle steps are completed, a new model prior is created based on the seasonal and cycle periods detected. For a multiplicative model, the data must all be positive and satisfy one of two tests. The model is restricted to multiplicative of additive decomposition, rather than allowing the full range of Box-Cox transformations, so that the decomposition is easy to compare to the original time series. Additive makes the decomposition comparative in levels, multiplicative makes the decomposition comparative in percentages relative to the original time series, but Box-Cox transformations don’t have an easy interpretation like this. However, the user can Box-Cox transform the data prior to estimation and specify multilicative = FALSE if that method is preferred.

Trend Test

First, is a test of a linear vs exponential trend. This test compares the log likelihood of 1) a linear regression of the seasonal and cycle adjusted data vs a linear trend, and 2) a linear regression of the logged seasonal and cycle adjusted data vs a linear trend using the petest from the lmtest package with heteroskedastic and autocorrelated standard errors using the sandwich package’s vcovHAC function. A log trend model is selected if and only if the linear model is rejected and the log model is not rejected.

Seasonal-Cycle Amplitude Test

Next, is a test for constant seasonal and cycle amplitudes using the Cox-Stuart deviation test on the detrended (and de-seasonalized or de-cycled) data. Then, if the seasonal component is determined to have changing amplitude, a multiplicative model is used. However, since this test is sensitive to outliers, outliers in the seasonal and cycle series are detected and replaced using the forecast’s package tsclean function without replacing missing values before the test is performed.

Trend Detection

The final step is to determine the specification of the trend after adjusting for the cycle and seasonality detected from the prior steps. This mainly boils down to determining if the data is integrated of order 0, 1, or 2 (I(0), I(1), or I(2)). To determine the number of differences to make the data stationary, the ADF, PP, and KPSS tests are performed on the seasonal and cycle adjusted data using the forecast package’s ndiffs functions on the seasonal adjusted time series. However, since this test is sensitive to outliers, outliers in the seasonal and cycle adjusted series are detected and replaced using the forecast’s package tsclean function without replacing missing values before the test is performed.

The number of differences selected for the trend specification is the average number of differences across the tests, rounded to the nearest digit. Next, the Cox-Stuart trend test is preformed on the seasonal and cycle adjusted data to determine if the data is trending. If the data is trending, the order of integration is set to at least 1.

If the order of integration is set to 1 at this point, the algorithm will then check for structural breaks in the drift using the breakpoints function from the strucchange package as long as the minimum prior drift is below 0. If the average value of the prior drift changes sign (from positive to negative, or negative to positive) between break segments, then the order of integration is set to 2.

If the data is determined to be

  • I(2) then the “double-random-walk” specification is used
  • I(1) with a trend, then the “random-walk-drift” specification is used
  • I(1) without a trend, then the “random-walk” specification is used
  • I(0) and no trend, then the “random-walk” trend specification with a deterministic trend is used to model a constant mean with an autoregressive component

The Kalman Filter

To estimate the unobserved components (\(T_t, C_t, S_t\), and \(D_t\)), the Kalman filter is utilized. The Kalman filter is a forward two-stage procedure that is the optimal linear filter based on the multivariate normal distribution. The “forwardness” of the filter means that it uses only data up to time \(t\) to make an inference on the unobserved components at time \(t\) and does peak into the future to make that inference.

Prediction Stage

The first stage is the prediction stage, which makes predictions of the state components based on information up to time \(t-1\). This stage is made up of four equations, the first is the prediction of the state components based on its time series properties

\[ B_{t|t-1} = D + F \beta_{t-1|t-1} + B^s X^s_t \]

Second, is the prediction of the covariance matrix of the state components

\[ P_{t|t-1} = F P_{t-1|t-1} F^{\prime} + Q \]

Third, is the prediction error of the time series of interest

\[ \eta_{t|t-1} = Y_t - (A + H \beta_{t|t-1} + B^o X^o_t) \]

And finally, we have the variance of the prediction error

\[ f_{t|t-1} = H P_{t|t-1} H^{\prime} + R \]

Updating Stage

The second stage is the updating stage, which makes predictions based on all information up to time \(t\). It consists of three equations. The first equation is the prediction of the state components based on the the full information set

\[ \beta_{t|t} = B_{t|t-1} + K_t \eta_{t|t-1} \] where \(K_t\) is the Kalman gain, which determines the optimal weight to give new information in making predictions about \(\beta_t\) and is the second equation

\[ K_t = P_{t|t-1} H^{\prime} f_{t|t-1}^{-1} \] The last equation is the updating of the covariance matrix of the unobserved components

\[ P_{t|t} = P_{t|t-1} - K_t H P_{t|t-1} \] The seven equations above, make up the full Kalman filter routine. If \(Y_t\) is missing for any observation, then

\[ B_{t|t} = B_{t|t-1} \\ P_{t|t} = P_{t|t-1} \\ K_t = 0 \\ f_{t|t-1} = \infty \]

Kalman Smoothing

Once the Kalman filter is applied to the data, a smoothing procedure can be applied in the backward direction to make a better inference of the state components based on the entire data set. Unlike the filter, the smoother does peak into the future to make an inference of the state components at time \(t\). This procedure consists of only two equations.

The first equation updates the prediction of the state components based on all the available information

\[ \beta_{t|T} = \beta_{t|t} + P_{t|t} F^{\prime} P_{t|t}^{-1} (\beta_{t+1|T} - \beta_{t+1|t}) \]

The second equation updates the covariance matrix of the state components based on all the available information

\[ P_{t|T} = P_{t|t} + P_{t|t} F^{\prime} P_{t+1|t}^{-1} (P_{t+1|T} - P_{t+1|t}) P_{t+1|t}^{-1 \prime} F P_{t|t}^{\prime} \]

Maximum Likelihood Estimation

The model above is estimated using MLE with box constraints. The log likelihood is given by

\[ ln(L(\theta)) = -\frac{1}{2} \sum_{t=1}^T \ln(|f_{t|t-1}|) - \frac{1}{2}\sum_{t=1}^T \eta_{t|t-1}^{\prime} f_{t|t-1}^{-1} \eta_{t|t-1} \]

Initial Parameter Values

Model Prior

Initial parameter values are essential for finding a good model when using maximum likelihood estimation. To this end, the algorithm uses initial parameter values derived from a decomposition using the mstl function, which acts like the model prior and is represented by

\[ Y_t^* = T_t^* + C_t^* + S_t^* + e_t^* \]

This function will decompose the time series into trend and seasonality. The seasonality is then further split into the seasonal harmonics by fitted a linear regression on the seasonal Fourier series and extracting the fitted components. Finally, the trend is further split into a smoother trend and the cycle by fitting a loess trend to the mstl trend using the loess function. The drift is then derived by taking the first difference of the trend.

Trend and Drift Initial Values

These parameters are set initial to match to the theoretical variance of the model depending on the trend type. If the trend is set to “random-walk” then \(\sigma_t = sd(\Delta T_t^{*})\). If the trend is set to “double-random-walk” then \(\sigma_t = \sigma_d = sd(\Delta T_t^{*})/sqrt(2)\). If the trend is set to “random-walk-drift”, an AR(1) model is fit the drift prior, \(\sigma_d\) is set to the standard deviation of the model errors, \(\phi_d\) is the set to the estimated AR(1) parameter, and \(\sigma_t\) is set to \(\sqrt{var(\Delta T_t^*) - \frac{\sigma_d^2}{1 - \phi_d^2}}\) the theoretical variance if the trend is I(1).

Cycle Initial Values

If the cycle model is set to the ARMA(p,q) specification or a cycle is not detected, \(p\) and \(q\) are determined using the forecast package’s auto.arima function on the prior cycle. The estimated AR and MA components are then used as the initial parameters. If the cycle is model is set to the trigonometric specification, \(\lambda\) is set to be \(\frac{2\pi}{c}\) where \(c\) is the estimated cycle period. \(\phi_c\) is set to the maximum eigenvalue of an ARMA(2,0,1) process fit to the cycle prior as this is related to the speed of convergence towards the trend as \(\phi_c\) is in an AR(1) process. Finally, \(\sigma_c\) is set to the standard deviation of the model errors.

Seasonality Initial Values

If seasonality is included, each \(\sigma_j\) is set to \(sd(\Delta S_t^*)/sqrt(h)\) where h is the number of seasonal harmonics so \(var(\Delta S_t^*) = \sum_j \sigma_j^2\)

Remainder

Lastly, \(\sigma_e\) is set the the standard deviation of the remainder of the model prior, \(sd(e_t^*)\).

Unobserved Component Initial Values

The initial values of the unobserved components must also be initialized. These values are not optimized but set in advance to aid in speed. The only value set to optimize is one value for the diagonal of the covariance matrix \(P_{0|0}\), which is set to the identity matrix. The initial values for the unobserved components are taken to be the first non-NA value for each series from the prior.

Box Constraints

The box constraints act like priors and provide bounds on parameter estimates to ensure a reasonable model. The constraints include stationarity constraints for the trigonometric cycle, \(0 < \phi_c < 1\), autoregressive cycle \(0 < \sum \phi_{c,i} < 1\), moving average component \(0 < \sum \theta_{c,i} < 1\), drift \(-1 < \phi_d < 1\), as well as \(\sigma_x > 0\) for all variance parameters. Trend smoothness constraints

\[ \sigma_t + \sigma_d < \sigma_e \\ \sigma_t + \sigma_d < \sigma_c \\ \sigma_t + \sigma_d < \sum_j \sigma_j \]

are also included to ensure that the trend accounts for the least variability among all the components. It can be relaxed by setting unconstrained = TRUE in stsm_estimate.

Interpolation Models

If an interpolation model is specified, the model is redefined based on the values set in interpolate and interpolate_method. For all interpolation models,

\[ I_t = T_{t-1} + S_{t-1} + C_{t-1} \] where \(I_t\) is the interpolated series and a deterministic observation equation is used by setting \(\sigma_e = 0\). The only other that changes is the specification of the observation equation, which is determined by the interpolation method

  • End of period, then \(Y_t = I_t\)
  • Period average, then \(Y_t = \frac{1}{N}(I_t + \ldots + I_{t-N+1})\)
  • Period sum, then \(Y_t = I_t + \ldots + I_{t-N+1}\)

where \(N\) is the period to interpolate over. For example, if the frequency of the data is quarterly and the user would like to interpolate the data to a monthly frequency, then \(N = 3\).

The interpolation frequency is set by specifying interpolate to “quarterly”, “monthly”, “weekly”, or “daily” as long as that frequency is higher than the frequency of the data. Although, it is recommended to only interpolate one frequency higher than the frequency of the data.

Interpolation methods are defined by setting interpolate_method to “eop” for end of period, “avg” for period average, and “sum” for period sum.

Interpolation errors will depend on the volatility of the time series as well as model misspecification. However, given the model, the end of period method will be more susceptible to errors due to volatility than either the average or sum methods.

Anomaly Detection

Anomaly detection uses the estimated structural model and the recursive nature of the Kalman filter to detect observations that are outside a given threshold of the model’s predicted values for time \(t\) given the data and estimates of the unobserved components at time \(t-1\). The threshold is determined by the forecast error variance of the estimated structural model. Anomalies that are detected can then be replaced with the modeled predicted values or threshold if the user wants to replace outliers.

It is important to note that anomalies are detected by comparison to the model, which is based on normally distributed innovations. Thus, the interpretation of an anomaly here is a value that has an “x” percent chance of occurring based on the data being generated from normal innovations, where the “x” percent is determined by the threshold. Normal distributions may not be appropriate for all time series, but this method of anomaly detection could be used to tell the user about the validity of the normality assumption: i.e. if more than 1% of the sample is detected as an anomaly with a 99% threshold, then normality may not be the best assumption for innovations for that particular series.

stsm_detect_anomalies will only plot if anomalies are detected and plot = TRUE.

The model makes recursive predictions for the j-th step ahead using

\[ \hat{Y_{t+j}} = H F^j \beta_{t} \] since \(A = 0\) and \(D = 0\) for this model. The j-step ahead forecast error variance is given by

\[ FEV = \left(\sum_{i=0}^{j-1} H F^i Q F^{i\prime} H^{\prime}\right) + R \] The error variance is based on the assumption of linearity and normality, as the Kalman filter is a linear Gaussian filter, so the confidence intervals should be understood to be a lower bound as the decomposition is a simplification of an arguable more complex process. The uncertainty lost in that abstraction to linearity is not quantifiable, and uncertainty will be lost in the simplification from a stochastic process with excess kurtosis or skewness to normality.

Structural Break Detection

Structural breaks are detected using the breakpoints function from the strucchange package. It attempts to detect structural breaks in the trend, cycle, and seasonalities based on the estimated structural model. These tests can show how well the stochastic nature of the model picks up on the time varying nature of the time series.

If the data has an upward (or downward trend), the algorithm will look for breaks in trend growth. Otherwise, it will look for breaks in the level of the trend. If the growth (or level) is not constant across the entire time series, then a break should be detected.

Structural breaks in the seasonality and cycle are detected by looking for changes in the mean absolute value of the seasonal series (an analog for the amplitude). To check from breaks in the amplitude, a simple test on the mean absolute value is performed as this is related to the amplitude via the relationship \(Avg(|x_t|) = \frac{2}{\pi}Amp\). If the amplitude is not stable across the entire series, then a break should be detected.

Seasonality is defined by a constant periodicity, but economic cycles are rarely constant in their periodicity. So, a second structural break test is used on the cycle to check for changes in the periodicity. To check for breaks in the periodicity, the cycle is modeled using the trigonometric structure described above (\(C_t = \phi_c cos(\lambda_t) C_{t-1} + \phi_c sin(\lambda_t) C_{t-1}^{*} = a C_{t-1} + b C_{t-1}^{*}\)) and looking for instabilities in the coefficients over the time sample. The periodicity is identified from the linear regression using the relationship \(period = \frac{2\pi}{tan^{-1}(\frac{b}{a})}\). If the coefficients of this model are not stable across the entire series, then a break should be detected.

This procedure can be time consuming, even with high performance parallel computing, so if the user is only interested in structural breaks for a particular component, the user can specify the components arguments of stsm_detect_breaks to be any combination of “trend”, “cycle”, and “seasonal”. stsm_detect_breaks will only plot the breaks that are detected if plot = TRUE.

Example Usage

Below is an example of how to use this package:

Simulated Data Decomposition

library(ggplot2)
library(gridExtra)
library(autostsm)
library(lubridate)
library(data.table)

set.seed(1024)

#Daily data
freq = 365.25

#Build the trend and drift
t = c()
m = c()
t[1] = 100
m[1] = 1
sig_e = 0.1
sig_t = 1
sig_m = 0.1
sig_s = 0.01
for(i in 2:3000){
  m[i] = 0.05 + 0.75*m[i-1] + rnorm(1, 0, sig_m)
  t[i] = t[i-1] + m[i-1] + rnorm(1, 0, sig_t)
}

#Build the seasonality including yearly and weekly 
s365 = sin(2*pi/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
s365 = (s365 - min(s365))/diff(range(s365))*(1.125 - 0.865) + 0.865
s7 = sin(2*pi/7*(1:length(t))) + rnorm(length(t), 0, sig_s)
s7 = (s7 - min(s7))/diff(range(s7))*(1.125 - 0.865) + 0.865
s = s365 + s7
s = (s - min(s))/diff(range(s))*(1.25 - 0.75) + 0.75

#Build the cyclicality using every 3 years periodicity
c = sin(2*pi*0.33/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
c = (c - min(c))/diff(range(c))*(1.25 - 0.75) + 0.75

#Build the data using a multiplicative model
ts = data.table(date = as.Date("2016-01-01") + 1:length(t), 
                y = t*c*s*exp(rnorm(length(t), 0, sig_e)), 
                trend = t, seasonal = s, seasonal7 = s7, 
                seasonal365 = s365, cycle = c)

#Create some missing values
ts[sample(1:nrow(ts), round(0.05*nrow(ts))), "y" := NA]

#View the data
g1 = ggplot(melt(ts, id.vars = "date", measure.vars = c("y", "trend"))) + 
  labs(title = "Observed vs Trend") + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")
g2 = ggplot(melt(ts, id.vars = "date", measure.vars = c("cycle"))) + 
  labs(title = "Cycle") + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")
g3 = ggplot(melt(ts, id.vars = "date", measure.vars = colnames(ts)[grepl("seasonal", colnames(ts))])) + 
  labs(title = "Seasonal") +
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")
grid.arrange(g1, g2, g3, layout_matrix = rbind(c(1, 1), c(2, 3)))

#Estimate the model
stsm = stsm_estimate(ts[, c("date", "y"), with = FALSE], verbose = TRUE)

#Forecast and plot the results
stsm_fc = stsm_forecast(stsm, y = ts[, c("date", "y"), with = FALSE], n.ahead = floor(stsm$freq)*3, plot = TRUE)

#Detect anomalies
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = ts[, c("date", "y"), with = FALSE], plot = TRUE), 
                by = "date", all = TRUE)

#Detect structural breaks
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = ts[, c("date", "y"), with = FALSE], plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

Using Exogenous Data

#Define the exogenous data
exo_obs = data.table(date = ts$date, Xo = rnorm(nrow(ts)))
exo_state = data.table(date = ts$date, Xs_trend = rnorm(nrow(ts)), 
                       Xs_seasonal = rep(0, nrow(ts)))
exo_state[seq(1, .N, by = 7), "Xs_seasonal" := 1]

#Estimate the model
stsm = stsm_estimate(ts[, c("date", "y"), with = FALSE], 
                     exo_obs = exo_obs, exo_state = exo_state, 
                     state_eqns = c("trend ~ Xs_trend - 1", "seasonal ~ Xs_seasonal - 1"), 
                     verbose = TRUE)

#Check the exogenous parameter estimates
stsm_ssm(model = stsm)[c("betaO", "betaS")]

#Filter and plot the results
stsm_fc = stsm_filter(stsm, y = ts[, c("date", "y"), with = FALSE], 
                      exo_obs = exo_obs, exo_state = exo_state, plot = TRUE)

#Detect anomalies
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = ts[, c("date", "y"), with = FALSE], 
                                       exo_obs = exo_obs, exo_state = exo_state, plot = TRUE), 
                by = "date", all = TRUE)

#Detect structural breaks
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = ts[, c("date", "y"), with = FALSE], 
                                   exo_obs = exo_obs, exo_state = exo_state, 
                                   plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

Real Data Decompositions

##### Unemployment rate examples #####
#Not seasonally adjusted
data("UNRATENSA", package = "autostsm") #From FRED
UNRATENSA = data.table(UNRATENSA, keep.rownames = TRUE)
colnames(UNRATENSA) = c("date", "y")
UNRATENSA[, "date" := as.Date(date)]
UNRATENSA[, "y" := as.numeric(y)]
stsm = stsm_estimate(UNRATENSA, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = UNRATENSA, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = UNRATENSA, plot = TRUE), 
                by = "date", all = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = UNRATENSA, plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

#Seasonally adjusted
data("UNRATE", package = "autostsm") #From FRED
UNRATE = data.table(UNRATE, keep.rownames = TRUE)
colnames(UNRATE) = c("date", "y")
UNRATE[, "date" := as.Date(date)]
UNRATE[, "y" := as.numeric(y)]
stsm = stsm_estimate(UNRATE, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = UNRATE, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = UNRATE, plot = TRUE), 
                by = "date", all = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = UNRATE, plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

##### GDP examples #####
#Not seasonally adjusted
data("NA000334Q", package = "autostsm") #From FRED
NA000334Q = data.table(NA000334Q, keep.rownames = TRUE)
colnames(NA000334Q) = c("date", "y")
NA000334Q[, "date" := as.Date(date)]
NA000334Q[, "y" := as.numeric(y)]
NA000334Q = NA000334Q[date >= "1990-01-01", ]
stsm = stsm_estimate(NA000334Q, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = NA000334Q, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = NA000334Q, plot = TRUE), 
                by = "date", all = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = NA000334Q, plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

#Seasonally adjusted
data("GDP", package = "autostsm") #From FRED
GDP = data.table(GDP, keep.rownames = TRUE)
colnames(GDP) = c("date", "y")
GDP[, "date" := as.Date(date)]
GDP[, "y" := as.numeric(y)]
GDP = GDP[date >= "1990-01-01", ]
stsm = stsm_estimate(GDP, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = GDP, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = GDP, plot = TRUE), 
                by = "date", all = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = GDP, plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

##### S&P 500 example #####
data("SP500", package = "autostsm") #From FRED
SP500 = data.table(SP500, keep.rownames = TRUE)
colnames(SP500) = c("date", "y")
SP500[, "date" := as.Date(date)]
SP500[, "y" := as.numeric(y)]
stsm = stsm_estimate(SP500, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = SP500, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = SP500, plot = TRUE), 
                by = "date", all = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = SP500, plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

##### 5 Year Treasury Yield example #####
data("DGS5", package = "autostsm") #From FRED
DGS5 = data.table(DGS5, keep.rownames = TRUE)
colnames(DGS5) = c("date", "y")
DGS5[, "date" := as.Date(date)]
DGS5[, "y" := as.numeric(y)]
stsm = stsm_estimate(DGS5, verbose = TRUE)
stsm_fc = stsm_forecast(stsm, y = DGS5, n.ahead = floor(stsm$freq)*3, plot = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_anomalies(stsm, y = DGS5, plot = TRUE), 
                by = "date", all = TRUE)
stsm_fc = merge(stsm_fc, 
                stsm_detect_breaks(stsm, y = DGS5, plot = TRUE, show_progress = TRUE), 
                by = "date", all = TRUE)

Simulated Data Interpolation

library(ggplot2)
library(autostsm)
library(lubridate)
library(data.table)

#Rebuild the data
set.seed(1024)

#Daily data
freq = 365.25

#Build the trend and drift
t = c()
m = c()
t[1] = 100
m[1] = 1
sig_e = 0.1
sig_t = 1
sig_m = 0.1
sig_s = 0.01
for(i in 2:3000){
  m[i] = 0.05 + 0.75*m[i-1] + rnorm(1, 0, sig_m)
  t[i] = t[i-1] + m[i-1] + rnorm(1, 0, sig_t)
}

#Build the seasonality including yearly and weekly 
s365 = sin(2*pi/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
s365 = (s365 - min(s365))/diff(range(s365))*(1.125 - 0.865) + 0.865
s7 = sin(2*pi/7*(1:length(t))) + rnorm(length(t), 0, sig_s)
s7 = (s7 - min(s7))/diff(range(s7))*(1.125 - 0.865) + 0.865
s = s365 + s7
s = (s - min(s))/diff(range(s))*(1.25 - 0.75) + 0.75

#Build the cyclicality using every 3 years periodicity
c = sin(2*pi*0.33/freq*(1:length(t))) + rnorm(length(t), 0, sig_s)
c = (c - min(c))/diff(range(c))*(1.25 - 0.75) + 0.75

#Build the data using a multiplicative model
ts = data.table(date = as.Date("2016-01-01") + 1:length(t), 
                y = t*c*s*exp(rnorm(length(t), 0, sig_e)))

#Convert the data to monthly by using the end of period
ts[, "mnth" := floor_date(date, "month")]
ts[, "rn" := .N:1, by = "mnth"]
ts = ts[rn == 1, c("mnth", "y"), with = FALSE]
colnames(ts)[colnames(ts) == "mnth"] = "date"

#Get the quarter and set the date to be the end of the quarter
ts[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
ts[, "y_avg" := frollmean(y, n = 3, align = "right")]
ts[, "y_sum" := frollsum(y, n = 3, align = "right")]

#We already know the data has a multiplicative structure with yearly seasonality and 
#the frequency below will be set to quarterly, 
#so we will set seasons = 4 and multiplicative to TRUE to help the model with identification
#since the data set goes from 3000 daily observations to 99 monthly observations and then to
#33 quarterly observations

#End of period
ts[, "rn" := .N:1, by = "qtr"]
ts.eop = ts[rn == 1, ][, "rn" := NULL]
stsm = stsm_estimate(y = ts.eop[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = TRUE,
                     interpolate = "monthly", interpolate_method = "eop", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = ts.eop[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")], 
                ts[, c("date", "y")], by = "date", all.x = TRUE, all.y = TRUE)
eop.err = cbind(method = "eop", stsm_fc[is.na(observed), .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")

#Period average
ts.avg = ts[, .(y = mean(y, na.rm = TRUE)), by = "qtr"]
stsm = stsm_estimate(y = ts.avg[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = TRUE,
                     interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = ts.avg[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")], 
                ts[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
  stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
  stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
avg.err = cbind(method = "avg", 
                stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))], 
                stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")

#Period sum
ts.sum = ts[, .(y = sum(y, na.rm = TRUE)), by = "qtr"]
stsm = stsm_estimate(y = ts.sum[, c("qtr", "y"), with = FALSE], seasons = 4, multiplicative = FALSE,
                     interpolate = "monthly", interpolate_method = "sum", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = ts.sum[, c("qtr", "y"), with = FALSE], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")], 
                ts[, c("date", "y", "y_sum")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
  stsm_fc[, "interpolated_rollup" := exp(frollsum(log(interpolated), n = 3, align = "right"))]
}else{
  stsm_fc[, "interpolated_rollup" := frollsum(interpolated, n = 3, align = "right")]
}
sum.err = cbind(method = "sum", 
                stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))], 
                stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_sum - interpolated_rollup)/(1 + y_sum)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")

#Errors depend on the volatility of the series
#the end of period method is more susceptible to the volatility of the series than either
#the average or sum methods
rbind(eop.err, avg.err, sum.err, use.names = TRUE, fill = TRUE)

Real Data Interpolation

library(ggplot2)
library(autostsm)
library(lubridate)
library(data.table)

##### Unemployment rate examples #####
#Not seasonally adjusted
data("UNRATENSA", package = "autostsm") #From FRED
UNRATENSA = data.table(UNRATENSA, keep.rownames = TRUE)
colnames(UNRATENSA) = c("date", "y")
UNRATENSA[, "date" := as.Date(date)]
UNRATENSA[, "y" := as.numeric(y)]
UNRATENSA[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
UNRATENSA[, "y_avg" := frollmean(y, n = 3, align = "right")]

#Cycle taken from previous example
stsm = stsm_estimate(y = UNRATENSA[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], 
                     seasons = 4, cycle = 156/3,
                     interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = UNRATENSA[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")], 
                UNRATENSA[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
  stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
  stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
cbind(stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))], 
      stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")

#Seasonally adjusted
data("UNRATE", package = "autostsm") #From FRED
UNRATE = data.table(UNRATE, keep.rownames = TRUE)
colnames(UNRATE) = c("date", "y")
UNRATE[, "date" := as.Date(date)]
UNRATE[, "y" := as.numeric(y)]
UNRATE[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
UNRATE[, "y_avg" := frollmean(y, n = 3, align = "right")]

#Cycle taken from previous example
stsm = stsm_estimate(y = UNRATE[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], 
                     seasons = FALSE, cycle = 156/3,
                     interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = UNRATE[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")], 
                UNRATE[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
  stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
  stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
cbind(stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))], 
      stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")

#5 Year Treasury Yield
data("DGS5", package = "autostsm") #From FRED
DGS5 = data.table(DGS5, keep.rownames = TRUE)
colnames(DGS5) = c("date", "y")
DGS5[, "date" := as.Date(date)]
DGS5[, "y" := as.numeric(y)]
DGS5[, "qtr" := ceiling_date(date, "quarter") %m-% months(1)]
DGS5[, "y_avg" := frollmean(y, n = 3, align = "right")]

#Cycle taken from previous example
stsm = stsm_estimate(y = DGS5[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], 
                     seasons = FALSE, cycle = 112/3,
                     interpolate = "monthly", interpolate_method = "avg", verbose = TRUE)
stsm_fc = stsm_filter(stsm, y = DGS5[, .(y = mean(y, na.rm = TRUE)), by = "qtr"], plot = TRUE)
stsm_fc = merge(stsm_fc[, c("date", "observed", "interpolated")], 
                DGS5[, c("date", "y", "y_avg")], by = "date", all.x = TRUE, all.y = TRUE)
if(stsm$multiplicative == TRUE){
  stsm_fc[, "interpolated_rollup" := exp(frollmean(log(interpolated), n = 3, align = "right"))]
}else{
  stsm_fc[, "interpolated_rollup" := frollmean(interpolated, n = 3, align = "right")]
}
cbind(stsm_fc[, .(mape = mean(abs(y - interpolated)/(1 + y)*100, na.rm = TRUE))], 
      stsm_fc[is.na(observed), .(mape_rollup = mean(abs(y_avg - interpolated_rollup)/(1 + y_avg)*100, na.rm = TRUE))])
ggplot(melt(stsm_fc, id.vars = "date", measure.vars = c("y", "interpolated"))[!is.na(value), ]) + 
  geom_line(aes(x = date, y = value, group = variable, color = variable)) + 
  scale_color_viridis_d() + 
  theme_minimal() + guides(color = guide_legend(title = NULL)) +
  theme(legend.position = "bottom")