The `smcfcs` package

The smcfcs function in the smcfcs package implements the SMC-FCS procedure. Currently linear, logistic and Cox proportional hazards substantive models. Competing risks outcome data can also be accommodated, with a Cox proportional hazards model used to model each cause specific hazard function. Partially observed variables can be imputed using normal linear regression, logistic regression (for binary variables), proportional odds regression (sometimes known as ordinal logistic regression, suitable for ordered categorical variables), multinomial logistic regression (for unordered categorical variables), and Poisson regression (for count variables). In the following we describe some of the important aspects of using smcfcs by way of an example data frame.

Example - linear regression substantive model with quadratic covariate effects

To illustrate the package, we use the simple example data frame ex_linquad, which is included with the package. This data frame was simulated for n=1000 independent rows. For each row, variables y,x,z,v were intended to be collected, but there are missing values in x. The values have been made artificially missing, with the probability of missingness dependent on (the fully observed) y variable. Below the first 10 rows of the data frame are shown:

library(smcfcs)
ex_linquad[1:10, ]

##             y          z          x           v
## 1  -0.3404639 -1.2053334 -1.2070657 -2.18088437
## 2   2.1699185  0.3014667  0.2774292  0.17779805
## 3   2.0293128 -1.5391452  1.0844412  0.97370618
## 4   6.6311247  0.6353707 -2.3456977 -1.15350311
## 5   3.9096291  0.7029518  0.4291247 -1.22676124
## 6  -0.5019313 -1.9058829         NA -0.53958740
## 7   0.5816303  0.9389214         NA -2.31497909
## 8   1.0236009 -0.2244921         NA -0.03351108
## 9  -1.2942170 -0.6738168         NA -1.01040885
## 10  1.9041271  0.4457874 -0.8900378 -2.72923160

We now impute the missing values in x, compatibly with a substantive model for the outcome y which is specified as a linear regression, with z, x and I(x^2) (the square of x) as covariates:

set.seed(123)
# impute missing values in x, compatibly with quadratic substantive model
imps <- smcfcs(originaldata=ex_linquad, smtype = "lm", smformula = "y~z+x+I(x^2)", method = c("", "", "norm", ""))

## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"

As demonstrated here, the minimal arguments to pass to smcfcs are the data frame to be used, the substantive model type, the substantive model formula, and a method vector. The substantive model type specifies the type of model - see the pacakge documentation for the current range of options. The smformula specifies the linear predictor of the substantive/outcome model. Here we specified that the outcome y is assumed to follow a linear regression model, with z, x and I(x^2) as predictors.

Lastly, we passed a vector of strings as the method argument. This specifies, for each column in the data frame, the method to use for imputation. As in the example, empty strings should be passed for those columns which are fully observed and thus are not to be imputed. For x we specify norm, in order to impute using a normal linear regression model. See the help for smcfcs for the syntax for other imputation model types.

Having generated the imputed datasets, we can now fit our substantive model of interest. Here we make use of the mitools package to fit our substantive model to each imputed dataset, collect the results, and combine them using Rubin’s rules:

# fit substantive model
library(mitools)
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + I(x^2)))
summary(MIcombine(models))

## Multiple imputation results:
##       with(impobj, lm(y ~ z + x + I(x^2)))
##       MIcombine.default(models)
##               results         se    (lower   upper) missInfo
## (Intercept) 0.9395207 0.04059690 0.8596269 1.019414     12 %
## z           1.0039949 0.03740834 0.9291711 1.078819     28 %
## x           0.9830796 0.04001903 0.9008448 1.065314     43 %
## I(x^2)      1.0471721 0.02232266 1.0032196 1.091125     13 %

Here the data were simulated such that the coefficients of z, x and I(x^2) are all 1. The estimates we have obtained are (reassuringly) close to these true parameter values. To illustrate the dangers of imputing a covariate using an imputation model which is not compatible with the substantive model, we now re-impute x, but this time imputing compatibly with a model for y which does not allow for the quadratic effect:

# impute missing values in x, compatibly with model for y which omits the quadratic effect
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x", method = c("", "", "norm", ""))

## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"

We now proceed to fit a model for y which includes both x and I(x^2) (plus z) as covariates:

# fit substantive model
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + I(x^2)))
summary(MIcombine(models))

## Multiple imputation results:
##       with(impobj, lm(y ~ z + x + I(x^2)))
##       MIcombine.default(models)
##               results         se    (lower    upper) missInfo
## (Intercept) 1.2518026 0.09465547 1.0452980 1.4583072     64 %
## z           1.1192199 0.06633381 0.9819356 1.2565042     46 %
## x           0.8221856 0.09109848 0.6040357 1.0403355     83 %
## I(x^2)      0.5541223 0.07326702 0.3690690 0.7391757     90 %

Now we have an estimate of the coefficient of I(x^2) of 0.55, which is considerably smaller than the true value 1 used to simulate the data. This bias is due to the imputation model we have just used for x being misspecified. In particular, it was misspecified due to the fact it wrongly assumed a linear dependence of y on x, rather than allowing a quadratic dependence.

Imputing using auxiliary variables with `smcfcs`

One of the strengths of multiple imputation in general is the possibility to use variables in imputation models which are subsequently not involved in the substantive model. This may be useful in order to condition or adjust for variables which are predictive of missingness, but which are not used in the substantive model of interest. Moreover, adjusting for auxiliary variables which are strongly correlated with one or more variables which are being imputed improves efficiency.

When using smcfcs to impute missing covariates, auxiliary variables v can be included by adding them as an additional covariate in the substantive model, as passed using the smformula argument. Here we are imputing x compatibly with a certain specification of model for the outcome. Our substantive model of interest is then a simpler model which omits v. For example, in the quadratic example dataset, we can add the auxiliary variable v using:

# impute, including v as a covariate in the substantive/outcome model
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x+I(x^2)+v", method = c("", "", "norm", ""))

## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z,v"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z,v  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"

# fit substantive model, which omits v
impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + I(x^2)))
summary(MIcombine(models))

## Multiple imputation results:
##       with(impobj, lm(y ~ z + x + I(x^2)))
##       MIcombine.default(models)
##               results         se    (lower   upper) missInfo
## (Intercept) 0.9510822 0.04345467 0.8647044 1.037460     23 %
## z           1.0198395 0.03925886 0.9404729 1.099206     35 %
## x           1.0010631 0.04003075 0.9189433 1.083183     42 %
## I(x^2)      1.0352702 0.02536115 0.9841539 1.086387     33 %

For outcome models other than linear regression, this approach is not entirely justifiable due to the lack of collapsibility of non-linear models. For example, if a Cox model is assumed for a failure time given variables x and v, the hazard function given only x (i.e. omitting v from the model) is no longer a Cox model. Further research is warranted to explore how this might affect the resulting inferences.

It is also possible to include the auxiliary variable v without adding it to the outcome model (as given in the smformula argument), through specification of the predictorMatrix argument. Doing so conditions on v, but assumes that the outcome is independent of v, conditional on whatever covariates are specified in smformula. This should thus only be used when the latter assumption is justified. When it is, inferences will in general be more efficient. To make this assumption when imputing x in the ex_linquad data, we define a predictorMatrix which will specify that x be imputed using both z and v, but we omit v from the smformula argument:

predMatrix <- array(0, dim = c(ncol(ex_linquad), ncol(ex_linquad)))
predMatrix[3, ] <- c(0, 1, 0, 1)
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x+I(x^2)", method = c("", "", "norm", ""), predictorMatrix = predMatrix)

## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z,v  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"

impobj <- imputationList(imps$impDatasets)
models <- with(impobj, lm(y ~ z + x + I(x^2)))
summary(MIcombine(models))

## Multiple imputation results:
##       with(impobj, lm(y ~ z + x + I(x^2)))
##       MIcombine.default(models)
##               results         se    (lower   upper) missInfo
## (Intercept) 0.9406931 0.04413897 0.8525920 1.028794     27 %
## z           1.0178161 0.04348032 0.9271345 1.108498     49 %
## x           0.9958780 0.03277003 0.9315101 1.060246      9 %
## I(x^2)      1.0368207 0.02337028 0.9905462 1.083095     20 %

Rejection sampling warnings

Sometimes when running smcfcs you may receive warnings that the rejection sampling that smcfcs uses has failed to draw from the required distribution on a couple of occasions. Upon receiving this warning, it is generally good idea to re-run smcfcs, specifying a value for rjlimit which is larger than the default, until the warning is no longer issued. Having said that, when only a small number of warnings are issued, it may be fine to ignore the warnings, especially when the dataset is large.

Assessing convergence

Like standard chained equations or FCS imputation, the SMC-FCS algorithm must be run for a sufficient number of iterations for the process to converge to its stationary distribution. The default number of iterations used is 10, but this may not be sufficient in any given dataset and model specification To assess convergence, the object returned by smcfcs includes an object called smCoefIter. This matrix contains the parameter estimates of the substantive model, and is indexed by imputation number, parameter number, and iteration number. To assess convergence, one can call smcfcs with m=1 and numit suitably chosen (e.g. numit=100). The values in the resulting smCoefIter matrix can then be plotted to assess convergence. To illustrate, we re-run the imputation model used previously with the example data, but asking for only m=1 imputation to be generated, and with 100 iterations.

# impute once with a larger number of iterations than the default 10
imps <- smcfcs(ex_linquad, smtype = "lm", smformula = "y~z+x+I(x^2)", method = c("", "", "norm", ""), predictorMatrix = predMatrix, m = 1, numit = 100)

## [1] "Outcome variable(s): y"
## [1] "Passive variables: "
## [1] "Partially obs. variables: x"
## [1] "Fully obs. substantive model variables: z"
## [1] "Imputation  1"
## [1] "Imputing:  x  using  z,v  plus outcome"

## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 6 times (across all variables,
## iterations, and imputations). You may want to increase the rejection sampling
## limit.

# plot estimates of the parameters of the substantive model against iteration number
plot(imps)

The plot shows that the process appears to converge rapidly, such that the default choice of numit=10 is probably fine here.

References

Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Statistical Methods in Medical Research, 2015; 24(4):462-487

van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 2011; 45(3)

smcfcs

Jonathan Bartlett

Introduction

Joint model and FCS multiple imputation

Imputation model compatibility

Substantive Model Compatible Fully Conditional Specification multiple imputation

Sampling from the imputation distribution

Statistical properties

When SMC-FCS may be preferable to FCS/MICE

The `smcfcs` package

Example - linear regression substantive model with quadratic covariate effects

Imputing using auxiliary variables with `smcfcs`

Rejection sampling warnings

Assessing convergence

References

smcfcs

Jonathan Bartlett

Introduction

Joint model and FCS multiple imputation

Imputation model compatibility

Substantive Model Compatible Fully Conditional Specification multiple imputation

Sampling from the imputation distribution

Statistical properties

When SMC-FCS may be preferable to FCS/MICE

The smcfcs package

Example - linear regression substantive model with quadratic covariate effects

Imputing using auxiliary variables with smcfcs

Rejection sampling warnings

Assessing convergence

References

The `smcfcs` package

Imputing using auxiliary variables with `smcfcs`