# Introduction

Missing data are a common issue in many fields of empirical research. An popular approach to handling missing data is the method of multiple imputation (MI). Multiple imputation involves replacing missing values by a number of imputations, creating multiple imputed datasets. Each completed dataset is then analysed as usual, and estimates and standard errors are combined across imputations using rules developed by Rubin.

The most popular approach to imputation uses parametric models for the missing variables given the observed. Multiple imputation gives valid inferences provided that the missing data satisfy the so called missing at random (MAR) assumption and that the imputation models used are correctly specified.

## Joint model and FCS multiple imputation

When multiple variables are affected by missingness, the traditional approach to imputation is to specify a joint (or multivariate model) for the partially observed variables. One of the earliest examples of this was MI using the multivariate normal model. Rather than specifying a joint model directly, a popular alternative is the fully conditional specification (FCS), or chained equations approach. In FCS MI separate conditional models are specified for each partially observed variable. In each of these conditional models, by default all of the variables serve as predictors. For an overview of the FCS MI and an implementation of it in R, see van Buuren and Groothuis-Oudshoorn.

## Imputation model compatibility

When missing values are imputed from a misspecified model, in general invalid inferences will result. One way in which misspecification can occur is when the imputation and substantive (analysis) model of interest are incompatible. Loosely speaking, this means there exists no joint model which contains the imputation model and the substantive model as the corresponding conditionals. In this case, as described by Bartlett et al (2015), assuming that the substantive model is correctly specified, unless the imputation and substantive models can be made compatible by imposing a restriction on the imputation model, incompatibility implies the imputation model is misspecified.

Such incompatibility between the imputation model used to impute a partially observed covariate and the substantive/outcome model can arise for example when the latter includes interactions or non-linear effects of variables. A further example is when the substantive model is a Cox proportional hazards model for a censored time to event outcome. In these cases, it may be difficult or impossible to specify an imputation model for a covariate which is compatible with the model for the outcome (the substantive model) using standard imputation models as available in existing packages.

# Substantive Model Compatible Fully Conditional Specification multiple imputation

The substantive model compatible modification of FCS MI (SMC-FCS), proposed by Bartlett et al (2015), ensures that each partially observed variable is imputed from an imputation model which is compatible with a user specified model for the outcome (which is typically the substantive model of interest, although see below regarding auxiliary variables). As described in further detail in the linked paper, for each partially observed variable, e.g. x1, in SMC-FCS a model is specified for the conditional distribution of x1 given the other partially observed variables x2,x3,..,xp and fully observed covariates z. This, together with the specified substantive model (a model for the outcome y) defines an imputation model for x1 which is guaranteed to be compatible with this specified substantive model.

## Sampling from the imputation distribution

Unfortunately, the resulting imputation model for each partially observed variable generally does not belong to a standard parametric family, complicating the imputation of missing values. To overcome this, smcfcs uses the method of rejection sampling, which is more computationally intensive than direct sampling methods.

## Statistical properties

SMC-FCS ensures compatibility between each partially observed covariate’s imputation model with the substantive model. However, when there is more than one partially observed variable, it does not guarantee that the corresponding different imputation models are mutually compatible. Consequently, as described further by Bartlett et al (2015), only in special cases does SMC-FCS generate imputations from a well defined Bayesian joint model. Nonetheless, by ensuring compatibility between each partially observed variable’s imputation model and the substantive model, it arguably overcomes (compared to standard FCS MI) the type of model incompatibility which is most likely to adversely affect inferences.

## When SMC-FCS may be preferable to FCS/MICE

In certain situations it may be advantageous to use SMC-FCS rather than traditional FCS MI. Important examples, as mentioned previously, include situations where the substantive (outcome) model includes interactions or non-linear effects of some of the covariates, or where the outcome model is itself non-linear, such as a Cox proportional hazards model. See Bartlett et al (2015) for simulation results comparing the two approaches in these situations.

# The smcfcs package

The smcfcs function in the smcfcs package implements the SMC-FCS procedure. Currently linear, logistic and Cox proportional hazards substantive models. Competing risks outcome data can also be accommodated, with a Cox proportional hazards model used to model each cause specific hazard function. Partially observed variables can be imputed using normal linear regression, logistic regression (for binary variables), proportional odds regression (sometimes known as ordinal logistic regression, suitable for ordered categorical variables), multinomial logistic regression (for unordered categorical variables), and Poisson regression (for count variables). In the following we describe some of the important aspects of using smcfcs by way of an example data frame.

## Example - linear regression substantive model with quadratic covariate effects

To illustrate the package, we use the simple example data frame ex_linquad, which is included with the package. This data frame was simulated for n=1000 independent rows. For each row, variables y,x,z,v were intended to be collected, but there are missing values in x. The values have been made artificially missing, with the probability of missingness dependent on (the fully observed) y variable. Below the first 10 rows of the data frame are shown:

library(smcfcs)
ex_linquad[1:10,]
##             y          z          x        xsq           v
## 1  -0.3404639 -1.2053334 -1.2070657 1.45700772 -2.18088437
## 2   2.1699185  0.3014667  0.2774292 0.07696698  0.17779805
## 3   2.0293128 -1.5391452  1.0844412 1.17601267  0.97370618
## 4   6.6311247  0.6353707 -2.3456977 5.50229771 -1.15350311
## 5   3.9096291  0.7029518  0.4291247 0.18414800 -1.22676124
## 6  -0.5019313 -1.9058829         NA         NA -0.53958740
## 7   0.5816303  0.9389214         NA         NA -2.31497909
## 8   1.0236009 -0.2244921         NA         NA -0.03351108
## 9  -1.2942170 -0.6738168         NA         NA -1.01040885
## 10  1.9041271  0.4457874 -0.8900378 0.79216734 -2.72923160

As shown, the xsq variable is equal to the square of the x variable. Since the latter has missing values, so does the former. We now impute the missing values in x and xsq, compatibly with a substantive model for the outcome y which is specified as a linear regression, with z, x and xsq as covariates:

set.seed(123)
#impute missing values in x, compatibly with quadratic substantive model
imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",method=c("","","norm","x^2",""))
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"

As demonstrated here, the minimal arguments to pass to smcfcs are the data frame to be used, the substantive model type, the substantive model formula, and a method vector. The substantive model type specifies whether the model for the outcome is linear, logistic or Cox regression, or a competing risks analysis (see documentation). The smformula specifies the linear predictor of the substantive/outcome model. Here we specified that the outcome y is assumed to follow a linear regression model, with z, x and xsq as predictors.

Lastly, we passed a vector of strings as the method argument. This specifies, for each column in the data frame, the method to use for imputation. As in the example, empty strings should be passed for those columns which are fully observed and thus are not to be imputed. For x we specify norm, in order to impute using a normal linear regression model. See the help for smcfcs for the syntax for other imputation model types. For xsq we specify "x^2" as the imputation method. This instructs smcfcs to impute xsq by simply squaring the imputed values of x. Such a specification could also be used with the mice package, which implements standard FCS MI. Note however that here, through specifying the substantive model as including an effect of xsq, smcfcs is imputing the missing values in x which allows for a quadratic effect on y.

Having generated the imputed datasets, we can now fit our substantive model of interest. Here we make use of the mitools package to fit our substantive model to each imputed dataset, collect the results, and combine them using Rubin’s rules:

# fit substantive model
library(mitools)
impobj <- imputationList(imps$impDatasets) models <- with(impobj, lm(y~z+x+xsq)) summary(MIcombine(models)) ## Multiple imputation results: ## with(impobj, lm(y ~ z + x + xsq)) ## MIcombine.default(models) ## results se (lower upper) missInfo ## (Intercept) 0.9334715 0.04630014 0.8403438 1.026599 32 % ## z 1.0052498 0.03733622 0.9308187 1.079681 26 % ## x 0.9849376 0.03375213 0.9183668 1.051508 15 % ## xsq 1.0387792 0.02430037 0.9902690 1.087289 27 % Here the data were simulated such that the coefficients of z, x and xsq are all 1. The estimates we have obtained are (reassuringly) close to these true parameter values. To illustrate the dangers of imputing a covariate using an imputation model which is not compatible with the substantive model, we now re-impute x, but this time imputing compatibly with a model for y which does not allow for the quadratic effect: #impute missing values in x, compatibly with model for y which omits the quadratic effect imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x",method=c("","","norm","x^2","")) ## [1] "Outcome variable(s): y" ## [1] "Passive variables: xsq" ## [1] "Partially obs. variables: x" ## [1] "Fully obs. substantive model variables: z" ## [1] "Imputation 1" ## [1] "Imputing: x using z plus outcome" ## [1] "Imputation 2" ## [1] "Imputation 3" ## [1] "Imputation 4" ## [1] "Imputation 5" ## Warning in smcfcs.core(originaldata, smtype, smformula, method, ## predictorMatrix, : Rejection sampling failed 3 times (across all variables, ## iterations, and imputations). You may want to increase the rejection sampling ## limit. smcfcs has issued some warnings about rejection sampling. We discuss this later in this vignette, but here we will continue and proceed to fit a model for y which includes both x and xsq (plus z) as covariates: # fit substantive model impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y~z+x+xsq))
summary(MIcombine(models))
## Multiple imputation results:
##       with(impobj, lm(y ~ z + x + xsq))
##       MIcombine.default(models)
##               results         se    (lower    upper) missInfo
## (Intercept) 1.2518297 0.07721520 1.0926282 1.4110313     45 %
## z           1.0923872 0.05635147 0.9799551 1.2048193     26 %
## x           0.9277074 0.06518336 0.7859114 1.0695035     63 %
## xsq         0.6093138 0.05299860 0.4845629 0.7340648     80 %

Now we have an estimate of the coefficient of xsq of 0.60, which is considerably smaller than the true value 1 used to simulate the data. This bias is due to the imputation model we have just used for x being misspecified. In particular, it was misspecified due to the fact it wrongly assumed a linear dependence of y on x, rather than allowing a quadratic dependence.

## Imputing using auxiliary variables with smcfcs

One of the strengths of multiple imputation in general is the possibility to use variables in imputation models which are subsequently not involved in the substantive model. This may be useful in order to condition or adjust for variables which are predictive of missingness, but which are not used in the substantive model of interest. Moreover, adjusting for auxiliary variables which are strongly correlated with one or more variables which are being imputed improves efficiency.

When using smcfcs to impute missing covariates, auxiliary variables v can be included by adding them as an additional covariate in the substantive model, as passed using the smformula argument. Here we are imputing x compatibly with a certain specification of model for the outcome. Our substantive model of interest is then a simpler model which omits v. For example, in the quadratic example dataset, we can add the auxiliary variable v using:

#impute, including v as a covariate in the substantive/outcome model
imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq+v",method=c("","","norm","x^2",""))
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z,v"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z,v  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 1 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
# fit substantive model, which omits v
impobj <- imputationList(imps$impDatasets) models <- with(impobj, lm(y~z+x+xsq)) summary(MIcombine(models)) ## Multiple imputation results: ## with(impobj, lm(y ~ z + x + xsq)) ## MIcombine.default(models) ## results se (lower upper) missInfo ## (Intercept) 0.9295283 0.04234163 0.8457342 1.013322 19 % ## z 1.0209851 0.03536731 0.9511068 1.090863 17 % ## x 0.9762357 0.03556863 0.9054434 1.047028 24 % ## xsq 1.0403799 0.02367377 0.9933561 1.087404 23 % For outcome models other than linear regression, this approach is not entirely justifiable due to the lack of collapsibility of non-linear models. For example, if a Cox model is assumed for a failure time given variables x and v, the hazard function given only x (i.e. omitting v from the model) is no longer a Cox model. Further research is warranted to explore how this might affect the resulting inferences. It is also possible to include the auxiliary variable v without adding it to the outcome model (as given in the smformula argument), through specification of the predictorMatrix argument. Doing so conditions on v, but assumes that the outcome is independent of v, conditional on whatever covariates are specified in smformula. This should thus only be used when the latter assumption is justified. When it is, inferences will in general be more efficient. To make this assumption when imputing x in the ex_linquad data, we define a predictorMatrix which will specify that x be imputed using both z and v, but we omit v from the smformula argument: predMatrix <- array(0, dim=c(ncol(ex_linquad),ncol(ex_linquad))) predMatrix[3,] <- c(0,1,0,0,1) imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",method=c("","","norm","x^2",""),predictorMatrix=predMatrix) ## [1] "Outcome variable(s): y" ## [1] "Passive variables: xsq" ## [1] "Partially obs. variables: x" ## [1] "Fully obs. substantive model variables: z" ## [1] "Imputation 1" ## [1] "Imputing: x using z,v plus outcome" ## [1] "Imputation 2" ## [1] "Imputation 3" ## [1] "Imputation 4" ## [1] "Imputation 5" ## Warning in smcfcs.core(originaldata, smtype, smformula, method, ## predictorMatrix, : Rejection sampling failed 1 times (across all variables, ## iterations, and imputations). You may want to increase the rejection sampling ## limit. impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y~z+x+xsq))
summary(MIcombine(models))
## Multiple imputation results:
##       with(impobj, lm(y ~ z + x + xsq))
##       MIcombine.default(models)
##               results         se    (lower   upper) missInfo
## (Intercept) 0.9424871 0.04305650 0.8572108 1.027763     20 %
## z           1.0174988 0.03674626 0.9445484 1.090449     22 %
## x           0.9935095 0.03267809 0.9294032 1.057616      6 %
## xsq         1.0340561 0.02323316 0.9882080 1.079904     16 %

## Rejection sampling warnings

Sometimes when running smcfcs you may receive warnings that the rejection sampling that smcfcs uses has failed to draw from the required distribution on a couple of occasions. Upon receiving this warning, it is generally good idea to re-run smcfcs, specifying a value for rjlimit which is larger than the default, until the warning is no longer issued. Having said that, when only a small number of warnings are issued, it may be fine to ignore the warnings, especially when the dataset is large.

## Assessing convergence

Like standard chained equations or FCS imputation, the SMC-FCS algorithm must be run for a sufficient number of iterations for the process to converge to its stationary distribution. The default number of iterations used is 10, but this may not be sufficient in any given dataset and model specification To assess convergence, the object returned by smcfcs includes an object called smCoefIter. This matrix contains the parameter estimates of the substantive model, and is indexed by imputation number, parameter number, and iteration number. To assess convergence, one can call smcfcs with m=1 and numit suitably chosen (e.g. numit=100). The values in the resulting smCoefIter matrix can then be plotted to assess convergence. To illustrate, we re-run the imputation model used previously with the example data, but asking for only m=1 imputation to be generated, and with 100 iterations.

# impute once with a larger number of iterations than the default 10
imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",method=c("","","norm","x^2",""),predictorMatrix=predMatrix,m=1,numit=100)
## [1] "Outcome variable(s): y"
## [1] "Passive variables: xsq"
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z,v  plus outcome"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 3 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.
# plot estimates of the fourth parameter of the substantive model against iteration number
plot(imps\$smCoefIter[1,4,])

The plot shows that the process appears to converge rapidly, such that the default choice of numit=10 is probably fine here. Of course we should also examine the corresponding plots for the other parameters of the substantive model, since convergence may require more than 10 iterations for some of these.

## References

Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Statistical Methods in Medical Research, 2015; 24(4):462-487

van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 2011; 45(3)