gFormulaMI - G-formula for Causal Inference via Multiple Imputation

Jonathan Bartlett


In this vignette we introduce the functionality of the gFormulaMI. The package implements an approach to constructing a G-formula estimator using multiple imputation methods and software, as described by Bartlett et al (2023).

First we use the simulated dataset provided in the package to demonstrate how to impute potential outcomes using gFormulaImpute. We then show how the resulting imputed datasets can be analysed using the syntheticPool function. Lastly, we illustrate how the functions can be used when the dataset has missing values to begin with (i.e. regular missing data).

Imputing using gFormulaImpute

First, we load the package


The first few rows of the simulated dataset that ships with the package look like

#>           l0 a0          l1 a1         l2 a2          y
#> 1 -0.6332580  1  0.59545963  1  2.7644551  1  3.4091077
#> 2  0.1182498  0  0.62444135  0 -0.9677095  1 -0.8586307
#> 3  0.4833364  1 -0.53836723  0 -0.7088896  0 -0.4378790
#> 4  0.6775931  0  0.57759241  1  2.0477487  1  3.8760967
#> 5  0.4278554  1  1.50188037  1  3.4155576  1  4.4027209
#> 6  0.6437421  1 -0.01322485  1  2.2178911  1  4.1302417

Here the l variables correspond to a confounder measured at baseline (l0) and two follow-up time points (l1 and l2). The time-varying binary treatment variable is stored in a0, a1 and a2. The final outcome variable is y.

Next we use gFormulaImpute to impute potential outcomes under two treatment regimes of interest. To do this, gFormulaImpute uses the mice multiple imputation function from the mice package. If passed a regular data frame, as here, gFormulaImpute expects (and requires) there to be no missing values in the data frame. To run, we have to tell gFormulaImpute which variables correspond to the time-varying treatment and the treatment regime(s) that we want it to create imputations for:

#impute synthetic datasets under two regimes of interest
imps <- gFormulaImpute(data=simDataFullyObs,M=10,trtVars=c("a0","a1","a2"),
#> [1] "Input data is a regular data frame."
#> [1] "Variables imputed using:"
#>     l0     a0     l1     a1     l2     a2      y regime 
#> "norm"     "" "norm"     "" "norm"     "" "norm"     "" 
#> [1] "Predictor matrix is set to:"
#>        l0 a0 l1 a1 l2 a2 y regime
#> l0      0  0  0  0  0  0 0      0
#> a0      1  0  0  0  0  0 0      0
#> l1      1  1  0  0  0  0 0      0
#> a1      1  1  1  0  0  0 0      0
#> l2      1  1  1  1  0  0 0      0
#> a2      1  1  1  1  1  0 0      0
#> y       1  1  1  1  1  1 0      0
#> regime  1  1  1  1  1  1 1      0

Here we have called gFormulaImpute requesting 10 imputations to be generated for two regimes of interest, corresponding to no treatment at each time point (0,0,0) and treatment at each time point (1,1,1).

Imputation model specification

The printed output above tells us what imputation methods have been used to impute (simulate) the time-varying confounders and outcome. By default, gFormulaImpute specifies to impute numeric variables using normal linear regression, which is why here l0, l1, l2 and y are being imputed using method norm. If there had been binary factors as time-varying confounders, these would be imputed using logistic regression. Factor time-varying confounders which are unordered are imputed using polytomous regression and a proportional odds model is used for ordered factors. If you want to modify these defaults, you can pass a method vector to gFormulaImpute, which specifies which of the imputation methods available in the mice package to use when imputing each column of the data frame.

The output also shows the predictorMatrix argument used when calling the mice imputation function. This shows, in a given row, which other variables (indicated by a 1) will be used as covariates in the imputation model for the variable labelled in that row. Thus here, l1 is being imputed using l0 and a0. For variables which are not being imputed (as indicated by an empty entry in the vector of imputation methods printed), the corresponding row in the predictor matrix has no effect on anything. In particular, the treatment variables a0, a1 and a2 are not being imputed based on a regression model.

You will notice that the printed predictorMatrix has 1s in the lower diagonal and 0s in the upper diagonal. This is because g-formula and hence gFormulaImpute, imputes sequentially forwards in time, using the past (i.e. to the left) variables as covariates. Because of this, the data frame you pass to gFormulaImpute must have the variables ordered in time as in the example above, so that the correct covariates are used in each model. If you wish to modify the predictor matrix used, you can pass a custom one using the predictorMatrix argument of gFormulaImpute.

Note that unlike in the usual missing data situation, no iteration is required when gFormulaImpute calls the mice imputation function, and thus when gFormulaImpute calls mice it sets its maxit argument to 1.

Number of imputations

In our call above we asked gFormulaImpute to create 10 imputations. Later when we analyse the imputations, the special pooling rules we will apply can, with a small number of imputations, produce negative variances for parameter estimates. To avoid this, we recommend using at least 25 imputations in general (we used 10 here for the sake of speed).

Number of individuals to simulate

In our call above we did not specify the nSim argument. This argument specifies how many individuals to simulate longitudinal histories for in each of the synthetic imputed datasets. Since we did not specify a value, the default is to simulate the same number as the number of rows in the data frame passed to gFormulaImpute (here 5,000). If more than one treatment regime is specified, nSim rows are simulated for each regime.

Analysing the imputations

gFormulaImpute creates a set of synthetic imputed datasets. That is, the imputed datasets created contain imputed (simulated) longitudinal histories based on the fitted models, under the treatment regimes specified. The original rows in the data frame passed to gFormulaImpute do not appear in the imputed datasets returned to us. The first few rows of the first imputed dataset are:

#>           l0 a0         l1 a1            l2 a2          y regime
#> 1  0.6177218  0  0.6397720  0 -0.0003761256  0  0.5996190      1
#> 2  0.2044563  0 -0.6073790  0 -1.7856679957  0 -1.7246571      1
#> 3  0.4327859  0 -0.3925652  0 -0.7555212806  0 -1.5840459      1
#> 4  0.6059553  0  2.4764117  0  0.7312018102  0  0.6342567      1
#> 5  0.2985589  0  0.9223022  0  0.0734605439  0 -1.1511625      1
#> 6 -0.1763264  0  0.6122369  0  0.8838681514  0  2.0858243      1

We see that in the first rows of the first imputed dataset, a0, a1 and a2 are always zero, because the first treatment regime we specified to impute for was (0,0,0). We also have an additional variable, called regime. This factor variable records which of the specified treatment regimes each row corresponds to. Thus here, in the first few rows of the data frame, the regime is 1.

To analyse the imputed datasets we first run our desired analysis of the imputed longitudinal histories. Here we will simply compare the mean of the final outcome variable y between the two regimes we have imputed for. To do this, we fit a linear model with y as the dependent variable and regime as covariate:

fits <- with(imps, lm(y~factor(regime)))

Because the imputed datasets we have analysed are fully synthetic, we cannot (or rather should not) pool our estimates using Rubin’s standard combination rules (e.g. as implemented in pool in the mice package). Doing so results in variances which are larger than they should be. Instead we use the synthetic imputation combination rules developed by Raghunathan et al (2003), implemented in the syntheticPool function:

#>                    Estimate       Within     Between       Total       df
#> (Intercept)     -0.02071539 0.0008125695 0.001763004 0.001126735 3.038045
#> factor(regime)2  2.96593502 0.0016251389 0.004203106 0.002998278 3.784951
#>                   95% CI L   95% CI U            p
#> (Intercept)     -0.1267874 0.08535658 5.803156e-01
#> factor(regime)2  2.8104386 3.12143146 1.304839e-06

Here the intercept corresponds to the mean of y under our first regime (0,0,0). The factor(regime)2 coefficient corresponds to difference in the mean of y between the second regime and the first. Here we see that the second regime has a mean outcome about 3 higher than the first. As well as the point estimate, we see the within, between and total imputation variances. Here we can see that the total variances are less than the between, which never happens with Rubin’s regular pooling rules. This is because in the synthetic pooling rules of Raghunthan et al (2003), the total variance is the between minus the within imputation variance (plus a correction for the finite number of imputations performed).

Datasets with missing values

Often the longitudinal dataset we have has some missing values. In this context, one approach is to use multiple imputation to impute these missing values using imputation software, assuming missing at random. Having imputed these, we can then pass the imputed datasets to gFormulaImpute to perform the synthetic imputation step.

To illustrate these steps, let’s now create a new dataset from the simulated one, making some values missing:

simDataMissing <- simDataFullyObs
simDataMissing$l0[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$l1[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$l2[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$y[runif(nrow(simDataMissing))<0.2] <- NA

Here we have simply made around 20% of values missing in l0, l1, l2 and y, completely randomly.

Next, we impute the simDataMissing data frame using the mice function:

simDataMissingImps <- mice::mice(simDataMissing,m=10,printFlag=FALSE)

In a real substantive analysis we should take more care to think about how we specify the imputation models. For more discussion of this point, see Bartlett et al (2023). In particular, note that the default behaviour of the mice function is to impute numeric variables using the predictive mean matching method, rather than normal linear regression (as in the gFormulaImpute function).

We can now pass our multiply imputed datasets to gFormulaImpute to create the synthetic imputations as before, but this time let’s try and generate 20 synthetic imputated datasets:

imps2 <- gFormulaImpute(data=simDataMissingImps,M=20,trtVars=c("a0","a1","a2"),
#> [1] "Input data is a mice created multiple imputation object."
#> [1] "Value passed to M being ignored."
#> [1] "Number of synthetic imputations to be generated set to 10 as in mids object passed to gFormulaImpute."
#> [1] "Variables imputed using:"
#>     l0     a0     l1     a1     l2     a2      y regime 
#> "norm"     "" "norm"     "" "norm"     "" "norm"     "" 
#> [1] "Predictor matrix is set to:"
#>        l0 a0 l1 a1 l2 a2 y regime
#> l0      0  0  0  0  0  0 0      0
#> a0      1  0  0  0  0  0 0      0
#> l1      1  1  0  0  0  0 0      0
#> a1      1  1  1  0  0  0 0      0
#> l2      1  1  1  1  0  0 0      0
#> a2      1  1  1  1  1  0 0      0
#> y       1  1  1  1  1  1 0      0
#> regime  1  1  1  1  1  1 1      0

The output looks much the same as the first time we called gFormulaImpute, apart from the first few lines of output. gFormulaImpute can see that there are only 10 imputations of the original dataset, and it refuses to generate a different number of imputations to the number in the mids multiple imputation dataset object we have passed to it. The imps2 object thus contains 10 imputations, which can be analysed using syntheticPool in the same way as before.


Bartlett JW, Olarte Parra C, Daniel RM. G-formula for causal inference via multiple imputation. 2023 arXiv 2301.12026

Raghunathan TE, Reiter JP, Rubin DB. 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1), p.1-16.