**pspatreg** is a package that fits penalized spline
(PS) semiparametric static and dynamic spatial autoregressive models via
Restricted (or Residual) Maximum Likelihood (REML) and Maximum
Likelihood (ML). This approach combines penalized regression spline
methods (Paul HC Eilers, Marx, and Durbán
2015) with standard spatial autoregressive models (such as SAR,
SEM and SDM). These types of models are thoroughly discussed in Mı́nguez, Basile, and Durbán (2020); see also
Montero, Mı́nguez, and Durbán (2012), Basile et al. (2014), and Hoshino (2018).

These models are very flexible since they make it possible to include
within the same specification: *i*) spatial autoregressive terms
(i.e. spatial lags of dependent and independent variables as well as
spatial error terms) to capture spatial interaction or network effects;
*ii*) time lags of the dependent variable to capture persistence
effects; *iii*) parametric and nonparametric (smooth) terms to
identify nonlinear relationships between the response variable and the
covariates; *iv*) a spatio-temporal trend, i.e. a smooth
interaction between the spatial coordinates and the time trend, to
capture site-specific nonlinear time trends.

The proposed method also allows the user to apply an ANOVA decomposition of the spatio-temporal trend into several components (spatial and temporal main effects, and second- and third-order interactions between them), which gives further insights into the dynamics of the data. Thus, we use the acronym PS-ANOVA-SAR (SEM, SDM, SLX) for the new data generating process (DGP) proposed. The use of nested B-spline bases for the interaction components of the spatio-temporal trend contributes to the efficiency of the fitting procedure without compromising the goodness of fit of the model. Finally, we also consider an extension of the PS-ANOVA-SAR (SEM, SDM, SLX) including the time lag of the dependent variable (dynamic spatial model) and/or a first-order time series autoregressive term process (AR1) in the noise to accommodate residual serial correlation.

In a very general form, the semiparametric spatial autoregressive model reads as:

\[ y_{it}=\alpha y_{it-1}+\rho \sum_{j=1}^N w_{ij,N} y_{jt} + \pi \sum_{j=1}^N w_{ij,N} y_{jt-1}+ \\ \sum_{k=1}^K \beta^*_k x^*_{k,it} + \sum_{k=1}^K \gamma^*_k \sum_{j=1}^N w_{ij,N} x^*_{k,jt}+ \sum_{\delta=1}^\Delta g_\delta(x_{\delta, it}) + \sum_{\delta=1}^\Delta h_\delta\left(\sum_{j=1}^N w_{ij,N} x_{\delta,jt}\right) + \\ \widetilde{ f}(s_{1i},s_{2i},\tau_t)+\epsilon_{it} \]

\[\epsilon_{it}=\theta \sum_{j=1}^N w_{ij,N}\epsilon_{jt}+\phi \epsilon_{it-1}+u_{it}\] \[u_{it} \sim i.i.d.(0,\sigma^2_u)\]

where \((y_{it},x^*_{1,it},...,x^*_{K,it},x_{1,it},...,,x_{\Delta,it})\) are data observed for a sample of spatial units \(i\) (\(i=1,...,N\)) in each time period \(t=1,...,T\). The terms \((s_{1i},s_{2i})\) indicate the spatial coordinates (latitude and longitude) of the spatial unit \(i\) (either a spatial point or the centroid of a spatial polygon), and \(W_{ij}\) is an element of a row-standardized spatial weights matrix. The terms included into the model are:

\(y_{it-1}\) = the time lag of \(y_{it}\);

\(\sum_{j=1}^N w_{ij,N} y_{it}\) = the contemporaneous spatial lag of \(y_{it}\);

\(\sum_{j=1}^N w_{ij,N} y_{it-1}\) = the time lag of the spatial lag of \(y_{it}\);

\(\alpha\), \(\rho\), \(\pi\), \(\beta^*_k\), and \(\gamma^*_k\) = fixed parameters;

\(\sum_{k=1}^K \beta^*_k x^*_{k,it}\) and \(\sum_{k=1}^K \gamma^*_k \sum_{j=1}^N w_{ij,N} x^*_{k,jt}\) = parametric linear terms of some covariates \(x_{k,it}\) and of their spatial lags;

\(g_\delta(.)\) and \(h_\gamma(.)\) = nonparametric smooth functions of other covariates and of their spatial lags (they can also accommodate varying coefficient terms, smooth interaction between covariates, factor-by-curve intercations, and so on);

\(\widetilde{ f}(s_{1i},s_{2i},\tau_t)\) = an unknown nonparametric spatio-temporal trend;

\(\epsilon_{it}\) = an idiosyncratic error term that can follow a spatial autoregressive process \(\epsilon_{it}=\theta \sum_{j=1}^N w_{ij,N}\epsilon_{it}+u_{it}\) with \(u_{it}\sim N(0,\sigma^2)\) (SEM), a time series autoregressive AR(1) process, i.e., \(\epsilon_{it}=\phi \epsilon_{it-1}+u_{it}\) with \(u_{it}\sim N(0,\sigma^2)\), or both.

Obviously, this very general specification is hardly suitable in real data applications. However, it is worth noticing that it nests most of the spatial additive models which can be used in practice. For example, a more suitable semiparametric static model reads as:

\[y_{it}=\rho \sum_{j=1}^N w_{ij,N} y_{jt} + \sum_{k=1}^K \beta^*_k x^*_{k,it} + \sum_{\delta=1}^\Delta g_\delta(x_{\delta, it}) + \widetilde{ f}(s_{1i},s_{2i},\tau_t)+\epsilon_{it}\]

\[\epsilon_{it}=\phi \epsilon_{it-1}+u_{it}\] \[u_{it} \sim i.i.d.(0,\sigma^2_u)\] This semiparametric SAR model turns out to be extremely useful to capture interactive spatial and temporal unobserved heterogeneity when the last one is smoothly distributed over space and time (Mı́nguez, Basile, and Durbán 2020). The dynamic extension (including \(y_{it-1}\) and \(\sum_{j=1}^N w_{ij,N} y_{it-1}\)) is also very promising and merits further theoretical investigation. Finally, the following semiparametric SAR model is very useful for modeling cross-setional spatial data taking into account nonlinearities, spatial dependence and spatial heterogeneity:

\[y_{i}=\rho \sum_{j=1}^N w_{ij,N} y_{j} + \sum_{k=1}^K \beta^*_k x^*_{k,i} + \sum_{\delta=1}^\Delta g_\delta(x_{\delta, i}) + \widetilde{ f}(s_{1i},s_{2i})+\epsilon_{i}\]

\[\epsilon_{i}\sim i.i.d.(0,\sigma^2_\epsilon)\] In many situations the spatio-temporal trend to be estimated can be complex, and the use of a single multidimensional smooth function may not be flexible enough to capture the structure in the data. To solve this problem, an ANOVA-type decomposition of \(\widetilde{ f}(s_{1i},s_{2i},\tau_t)\) can be used, where spatial and temporal main effects, and second- and third-order interactions between them can be identified:

\[\widetilde{ f}(s_{1i},s_{2i},\tau_t)=f_1(s_{1i})+f_2(s_{2i})+f_{\tau}(\tau_t)+ \\ f_{1,2}(s_{1i},s_{2i})+f_{1,\tau}(s_{1i},\tau_t)+f_{2,\tau}+(s_{2i},\tau_t)+f_{1,2,\tau}(s_{1i},s_{2i},\tau_t)\]

First, the geoadditive terms given by \(f_1(s_{1i}),f_2(s_{2i}),f_{1,2}(s_{1i},s_{2i})\) work as control functions to filter the spatial trend out of the residuals, and transfer it to the mean response in a model specification. Thus, they make it possible to capture the shape of the spatial distribution of \(y_{it}\), conditional on the determinants included in the model. These control functions also isolate stochastic spatial dependence in the residuals, that is, spatially autocorrelated unobserved heterogeneity. Thus, they can be regarded as an alternative to the use of individual regional dummies to capture unobserved heterogeneity, as long as such heterogeneity is smoothly distributed over space. Regional dummies peak at significantly higher and lower levels of the mean response variable. If these peaks are smoothly distributed over a two-dimensional surface (i.e., if unobserved heterogeneity is spatially autocorrelated), the smooth spatial trend is able to capture them.

Second, the smooth time trend, \(f_{\tau}(\tau_t)\), and the smooth interactions between space and time - \(f_{1,\tau}(s_{1i},\tau_t),f_{2,\tau},(s_{2i},\tau_t),f_{1,2,\tau}(s_{1i},s_{2i},\tau_t)\) - work as control functions to capture the heterogeneous effect of common shocks. Thus, conditional on a smooth distribution of the spatio-temporal heterogeneity, the PS-ANOVA-SAR (SDM, SEM, SLX) model works as an alternative to the models proposed by Bai and Li (2013), Shi and Lee (2018), Pesaran and Tosetti (2011), Bailey, Holly, and Pesaran (2016) and Vega and Elhorst (2016) based on extensions of common factor models to accommodate both strong cross-sectional dependence (through the estimation of the spatio-temporal trend) and weak cross-sectional dependence (through the estimation of spatial autoregressive parameters).

Furthermore, this framework is also flexible enough to control for the linear and nonlinear functional relationships between the dependent variable and the covariates as well as the heterogeneous effects of these regressors across space. The model inherits all the good properties of penalized regression splines, such as coping with missing observations by appropriately weighting them, and straightforward interpolation of the smooth functions.

All the non-parametric terms are modeled using Penalized-splines (P-Splines, P. H. Eilers and Marx (1996)). This methodology assumes that the unknown function to be estimated is smooth, and can be represented as a linear combination of basis functions,

\[f(x)=\sum_j \theta_j B_j(x),\] a popular election is the use of cubic B-spline basis (De Boor 1977). In the case of multidimensional functions, the basis is calculated as tensor products of the marginal basis functions in each dimension. The smoohness of the curve/surface is controlled by adding to the likelihood a penalty tunned by a smoothing parameter, \(\lambda\) (the number of smoothing parameters used is equal to the dimension of the function to be estimated). The penalty can take many forms, depending of the prior knowledge on the curve/surface. The most common choice is to use second order differences on adjacent B-spline coefficients (see Mı́nguez, Basile, and Durbán 2020). In the case of a univariate function, \(f(x)\), the penalty is:

\[ Penalty= \lambda\sum_j \left ( \theta_j-2\theta_{j-1}+\theta_{j-2}\right )^2.\]

As we mentioned above, the model is estimated via REML or ML. In order to be able to estimate simultaneously all parameters in the model (including the smothing parameters) we make use of the fact that a P-spline can be reparameterized as a mixed model, and so, all the methodology debeloped in this context can be use for estimation and inference. Furthermore, the mixed model setting allows us to impose the necessary constraints so that all terms in the model are identifiable, as well as, to add spatial (or other) random effects if they are necessary.

To speed up computations, we use a modification of the SOP (Separation of Anisotropic Penalties) algorithm derived by Rodriguez-Alvarez et al. (2015) (for variance components estimation). Also, the use of nested B-spline bases (Lee, Durban, and Eilers 2013) for the interaction components of the spatio-temporal trend contributes to the efficiency of the fitting procedure without compromising the goodness of fit of the model. See Mı́nguez, Basile, and Durbán (2020) for more details.

This package is available in GitHub (https://github.com/rominsal/pspatreg) and can be
installed in the usual way.^{1}

The package is accompanied by other two vignettes
**B_Examples_pspatreg_CS_data.html** and
**C_Examples_pspatreg_Panel_data.html**, introducing the
application of `pspatreg`

to cross-sectional and panel data.
Here, we introduce some basic general information about the package.

`pspatfit()`

The main function in the `pspatreg`

package is
`pspatfit()`

, which estimates spatio-temporal pernalized
spline spatial regression models using either the Restricted Maximum
Likelihood (REML) method or the Maximum Likelihood (ML) method. In its
generic form `pspatfit()`

appears as:

`pspatfit(formula, data, na.action, listw = NULL,type = "sim", method = "eigen", Durbin = NULL, zero.policy = NULL,`

`interval = NULL,trs = NULL, cor = "none", dynamic = FALSE, control = list())`

The function `pspatfit()`

returns a list of quantities of
class `pspat`

, including coefficients of the parametric terms
and their standard errors, estimated coefficients corresponding to
random effects in mixed model and their standard errors, equivalent
degrees of freedom, residuals, fitted values, etc. A wide range of
standard methods is also available for the `pspat`

objects,
including `print()`

, `summary()`

,
`coef()`

, `vcov()`

, `anova()`

,
`fitted()`

, `residuals()`

, and
`plot()`

.

`formula`

The argument `formula`

within the function
`pspatfit()`

is formula similar to GAM specification
including parametric and non-parametric terms. Parametric covariates are
included in the usual way and non-parametric p-spline smooth terms are
specified using `pspl(.)`

and `pspat(.)`

for the
non-parametric covariates and spatial or spatio-temporal trends,
respectively. For example

```
formula <- y ~ x1 + x2 + pspl(x3, nknots = 15) + pspl(x4, nknots = 20) +
pspt(long, lat, year, nknots = c(18,18,8),
psanova = TRUE,
nest_sp1 = c(1, 2, 3),
nest_sp2 = c(1, 2, 3),
nest_time = c(1, 2, 2))
```

In the example above, the model includes two parametric terms, two
nonparametric terms, and a spatio-temporal trend (with long and lat as
spatial coordinates, and year as temporal coordinate). The dimension of
the basis function both in `pspl(.)`

and `pspt(.)`

is defined by `nknots`

. This term should not be less than the
dimension of the null space of the penalty for the term (see
`null.space.dimension`

and `choose.k`

from package
`mgcv`

to know how to choose `nknots`

). The
default number of `nknots`

in `pspl(.)`

is 10, but
in this example we have chosen 15 `nknots`

for
`g_1(x_3)`

and 20 `nknots`

for
`g_2(x_4)`

. The default number of `nknots`

in
`pspt(.)`

is `c(10,10,5)`

, but we have chosen
`c(18,18,8)`

.

In this example we also adopt an ANOVA decomposition of the
spatio-temporal trend (choosing `psanova = TRUE`

). Each
effect has its own degree of smoothing allowing a greater flexibility
for the spatio-temporal trend. Calculating up to third-order
interactions can be computationally expensive. To address this problem,
we can select subgroups of interaction effects for the second- and
third-order effects. To define these subgroups, we use three parameters
available in `pspt()`

: `nest_sp1`

,
`nest_sp2`

, and `nest_time`

. These parameters
indicate the divisors of the `nknots`

parameters. For
example, if we set `nest_sp1 = c(1,2,3)`

, we will have all
knots for the `s_1`

effect, 18/2 for each second-order
effects with `s_1`

, and 18/3 nots for the third order effect
with `s_1`

.^{2}

If we want to set any main effect to `0`

, we must set the
parameters `f1_main`

, `f2_main`

or
`ft_main`

to `FALSE`

, The default is
`TRUE`

. We can also exclude second- or third-order effects
setting `f12_int`

, `f1t_int`

,
`f2t_int`

, `f12t_int`

to `FALSE`

.

`Type`

Using the argument `Type`

we can choose different spatial
model specifications: `"sar"`

, `"sem"`

,
`"sdm"`

, `"sdem"`

, `"sarar"`

, or
`"slx"`

. When creating a `"slx"`

,
`"sdem"`

or `"sdm"`

model, we need to include the
formula of the durbin part in the Durbin parameter.

The argument `data`

must contain all the variables
included in parametric and non-parametric terms of the model. If a
`pspat(.)`

term is included in `formula`

, the data
must contain the spatial and temporal coordinates specified in
`pspat(.)`

. In this case, the coordinates must be ordered
choosing time as fast index and spatial coordinates as slow indexes.

Both `data.frame`

and `sf`

class objects can be
used as `data`

inputs.^{3} `sf`

objects are recommended
since they allow the user to map spatial trends. In our demos we use two
datasets in `sf`

version.

Plotting the estimated non-parametric smooth terms represents an
important step in semiparametric regression analyses. First, the
function `fit_terms()`

computes estimated non-parametric
smooth terms. Then, the functions `plot_sp2d()`

and
`plot_sp3d()`

are used to plot and map spatial and
spatio-temporal trends, respectively, while `plot_sptime()`

is used to plot the time trend for PS-ANOVA models in 3d; finally, and
`plot_terms()`

is used to plot smooth non-parametric
terms.

In the case of a semiparametric model without the spatial lag of the
dependent variable (PS model), if all regressors are manipulated
independently of the errors, \(\widehat{g}_\delta(x_{\delta, it})\) can be
interpreted as the conditional expectation of \(y\) given \(x_{\delta}\) (net of the effect of the
other regressors). @blu.pow.03 use the term Average Structural Function
(ASF) with reference to these functions. Instead, in PS-SAR, PS-SDM or
in PS-SARAR model, when \(\rho\) is
different from zero, the estimated smooth functions cannot be
interpreted as ASF. Taking advantage of the results obtained for
parametric SAR, we can compute the total smooth effect (total–ASF) of
\(x_{\delta}\) as:

\[\widehat{g}_{\delta}^{T}\left(x_{\delta}\right)=\Sigma_{q}
\left[\textbf{I}_{n}-\widehat{\rho}\textbf{W}_{n}\right]^{-1}_{ij}
b_{\delta q}(x_{\delta})\widehat{\beta}_{\delta q}\]

where \(b_{\delta q}(x_{\delta})\) are P-spline basis functions, and \(\widehat{\beta}_{\delta q}\) the corresponding estimated parameters.

We can also compute direct and indirect (or spillover) impacts of smooth terms in the PS-SAR case as:

\[\widehat{g}_{\delta}^{D}\left(x_{\delta}\right)=\Sigma_{q} \left[\textbf{I}_{n}-\widehat{\rho}\textbf{W}_{n}\right]^{-1}_{ii} b_{\delta q}(x_{k})\widehat{\beta}_{\delta q}\]

\[\widehat{g}_{\delta}^{I}\left(x_{\delta}\right)=\widehat{g}_{\delta}^{T}\left(x_{\delta}\right)-\widehat{g}_{\delta}^{D}\left(x_{\delta}\right)\]

Similar expressions can be provided for the direct, indirect and total impacts of the PS-SDM.

The function `impactspar()`

computes direct, indirect and
total impacts for continuous parametric covariates using the standard
procedure for their computation (LeSage and Pace
2009).

The function `impactsnopar()`

computes direct, indirect
and total impacts functions for continuous non-parametric covariates,
while the function `plot_impactsnopar()`

is used to plot
these impacts functions. It is worth noticing that total, direct and
indirect impacts are never smooth over the domain of the variable \(x_\delta\) due to the presence of the
spatial multiplier matrix in the algorithm for their computation.
Indeed, a wiggly profile of direct, indirect and total impacts would
appear even if the model were linear. Therefore, in the spirit of the
semiparametric approach, we included the possibility of applying a
spline smoother to obtain smooth curves (using the argument
`smooth=TRUE`

in the function
`plot_impactsnopar()`

).

Bai, J., and K. Li. 2013. “Spatial Panel Data Models with Common
Shocks.” MPRA Paper 52786. University of Munich, Germany. https://ideas.repec.org/p/pra/mprapa/52786.html.

Bailey, Natalia, Sean Holly, and M. Hashem Pesaran. 2016. “A
Two-Stage Approach to Spatio-Temporal Analysis with Strong and Weak
Cross-Sectional Dependence.” *Journal of Applied
Econometrics* 31 (1): 249–80.

Basile, Roberto, María Durbán, Román Mínguez, Jose María Montero, and
Jesús Mur. 2014. “Modeling Regional Economic Dynamics: Spatial
Dependence, Spatial Heterogeneity and Nonlinearities.”
*Journal of Economic Dynamics and Control* 48: 229–45. https://doi.org/10.1016/j.jedc.2014.06.011.

De Boor, C. 1977. “Package for Calculating with
B-Splines.” *Journal of Numerical Analysis*
14: 441–72.

Eilers, P. H., and B. D. Marx. 1996. “Flexible
Smoothing with B-Splines and
Penalties.” *Statistical Science* 11: 89–121.

Eilers, Paul HC, Brian D Marx, and Maria Durbán. 2015. “Twenty
Years of p-Splines.” *SORT-Statistics and Operations Research
Transactions* 39 (2): 149–86.

Hoshino, T. 2018. “Semiparametric Spatial Autoregressive Models
with Endogenous Regressors: With an Application to Crime Data.”
*Journal of Business & Economic Statistics* 36: 160–72.

Lee, D. J., M. Durban, and P. Eilers. 2013. “Efficient
Two-Dimensional Smoothing with P-Spline ANOVA
Mixed Models and Nested Bases.” *Computational Statistics
& Data Analysis* 61: 22–37.

LeSage, J., and K. Pace. 2009. *Introduction to Spatial
Econometrics*. Boca Raton: CRC Press.

Mı́nguez, Román, Roberto Basile, and Marı́a Durbán. 2020. “An
Alternative Semiparametric Model for Spatial Panel Data.”
*Statistical Methods & Applications* 29 (4): 669–708. https://doi.org/10.1007/s10260-019-00492-8.

Montero, J, R Mı́nguez, and M Durbán. 2012. “SAR
Models with Nonparametric Spatial Trends. A
P-Spline Approach.” *Estadı́stica
Española* 54 (177): 89–111.

Pesaran, M Hashem, and Elisa Tosetti. 2011. “Large Panels with
Common Factors and Spatial Correlation.” *Journal of
Econometrics* 161 (2): 182–202.

Rodriguez-Alvarez, M. X., T. Kneib, M. Durban, D. J. Lee, and P. Eilers.
2015. “Fast Smoothing Parameter Separation in Multidimensional
Generalized P-Splines: The SAP
Algorithm.” *Statistics and Computing* 25 (5): 941–57.

Shi, Wei, and Lung-fei Lee. 2018. “A Spatial Panel Data Model with
Time Varying Endogenous Weights Matrices and Common Factors.”
*Regional Science and Urban Economics* 72: 6–34.

Vega, Solmaria Halleck, and J Paul Elhorst. 2016. “A Regional
Unemployment Model Simultaneously Accounting for Serial Dynamics,
Spatial Dependence and Common Factors.” *Regional Science and
Urban Economics* 60: 85–95.

To install any R package from GitHub you need to have previously installed

`devtools`

package from CRAN. Then execute the commands`library(devtools)`

, to load`devtools`

, and install`github("rominsal/pspatreg")`

to install`pspatreg`

package:`devtools::install_github("rominsal/pspatreg")`

.↩︎In most empirical cases, the main effects are more flexible than interaction effects and therefore the number of knots in B-Spline bases for interaction effects do not need to be as large as the number of knots for the main effects (Lee, Durban, and Eilers 2013).↩︎

`sf`

means simple features of spatial vector objects. The geographic vector data model is based on points located within a coordinate reference system (CRS). Points can represent self-standing features (e.g., the location of a house) or they can be linked together to form more complex geometries such as lines and polygons. Most point geometries contain only two dimensions`x`

and`y`

(3-dimensional CRSs contain an additional`z`

value, typically representing height above sea level).`sf`

objects provide both a*geometry*information, describing where on Earth the feature is located, and*attributes*information, describing other properties (like the population of the region, the unemployment rate, etc.).`data.frame`

objects store only attributes information.↩︎