[R] SEM validation: Cross-Validation vs. Bootstrapping

Thu Nov 1 18:24:43 CET 2012

Hello All,

Recently, I was asked to help out with an SEM cross-validation analysis. Initially, the project was based on "sample-splitting" where half of cases were randomly assigned to a training sample and half to a testing sample. Attempts to replicate a model developed in the training sample using the testing sample were not entirely successful. A number of parameter estimates were substantially different and were subsequently shown to be significantly different in multiple group analyses using cross-group constraints and a difference in chi-square test.

There is a discussion that starts on page 90 in Frank Harrell's book Regression Modeling Strategies that seems to shed light on why this might be the case. In essence, the results are largely a matter of the luck of the draw. Choose one random seed in splitting the sample and the results cross-validate. Choose another and they might not. 

The book then goes on to suggest some improvements on data splitting. The most promising of these appears to be bootstrapping. In the book, this typically involves fitting, say, a regression model in one’s entire dataset, fitting the model in a series of bootstrap datasets, and then applying the results of each bootstrap model to the original data, in order to derive a measure of optimism in something like R2 or MSE. 

Our SEM would likely require something slightly different. That is, we would need to develop a model based on the entire sample, run the sample model on a series of bootstrap datasets, obtain the average (as well as the SD and 95% CI) for each of the model parametersrs across the bootstrap samples, and then compare that with what we got running the model on the original sample. Some of my other books show something like this for regression (e.g., An R Companion to Applied Regression, page 187; The R Book, page 418). 

So now having provided quite a bit of background, let me ask a few questions:

1. Is there any general agreement that the approach I've suggested is the way to go? Are there others besides Dr. Harrell that I could cite in pursuing this approach?

2. Does anyone know of some substantial published applications of this approach using SEM?

3. Would any of the available R packages for SEM (e.g., lavaan, sem, OpenMx) be particulary straightforward to use in doing the bootstrapping? Thus far, the SEM has been done using MPLUS. I've not tried SEM in R yet, but would be interested in giving it a shot. The SEM itself is relatively straightforward. Four latent variables, one with 7 indicators and the others with 4 indicators each. A couple of indirect paths involving mediation. Some pretty non-normal data though.  Lots of missingness too that might need to be dealt with using Multiple Imputation. 

Thanks,

Paul