[R] validate (rms package) using step instead of fastbw

Fri Feb 12 18:11:37 CET 2010

Frank, let me make sure I understand:

On Fri, Feb 12, 2010 at 5:57 PM, Frank E Harrell Jr
<f.harrell at vanderbilt.edu> wrote:
> Ramon Diaz-Uriarte wrote:
>>
>> Dear Frank,
>>
>> Thanks a lot for your response. And apologies for the question,
>> because the answer was obviously in the help.
>>
>> As for the caveats on selection: yes, thanks. I think I am actually
>> closely following your book (eg., pp. 249 to 253), and one of the
>> points I am trying to make to my colleagues is that by doing variable
>> selection, we are actually getting a worse model (as evidenced by the
>> bias-corrected AUC, which is smaller if attempting variable
>> selection).
>>
>>
>> Best,
>>
>> R.
>
> Thanks Ramon.
>
> Bias-corrected measures need to be penalized for all variable selection
> steps and for univariable screening.  When the penalization is complete, you
> usually see worse model performance as compared with full model fits, as you
> wrote.
>

I thought that by using validate, and starting from the original
(non-screened) model and using "bw = TRUE" in the call to validate,
the bias-corrected statistics already include that penalization. After
all, for each one of the bootstrap iterations, the selection process
is carried out only with the in-bag bootstrap sample, but the "test"
is conducted with the out-of-bag sample. So my understanding was that
using the Dxy under the "corrected index" column I had accounted for
the screening involved in the variable selection.

Thanks,

R.

> Cheers
> Frank
>
>>
>>
>>
>>
>>
>> On Fri, Feb 12, 2010 at 3:13 PM, Frank E Harrell Jr
>> <f.harrell at vanderbilt.edu> wrote:
>>>
>>> Ramon Diaz-Uriarte wrote:
>>>>
>>>> Dear All,
>>>>
>>>> For logistic regression models: is it possible to use validate (rms
>>>> package) to compute bias-corrected AUC, but have variable selection
>>>> with AIC use step (or stepAIC, from MASS), instead of fastbw?
>>>>
>>>>
>>>> More details:
>>>>
>>>> I've been using the validate function (in the rms package, by Frank
>>>> Harrell) to obtain, among other things, bootstrap bias-corrected
>>>> estimates of the AUC, when variable selection is carried out (using
>>>> AIC as criterion). validate calls predab.resample, which in turn calls
>>>> fastbw (from the Design package, by Harrell). fastbw " Performs a
>>>> slightly inefficient but numerically stable version of  fast backward
>>>> elimination on factors, using a method based on Lawless and Singhal
>>>> (1978). This method uses the fitted complete model (...)". However, I
>>>> am finding that the models returned by fastbw are much smaller than
>>>> those returned by stepAIC or step (a simple example is shown below),
>>>> probably because of the approximation and using the complete model.
>>>>
>>>> I'd like to use step instead of fastbw. I think this can be done by
>>>> hacking predab.resample in a couple of places but I am wondering if
>>>> this is a bad idea (why?) or if I am reinventing the wheel.
>>>>
>>>>
>>>> Best,
>>>>
>>>> R.
>>>>
>>>>
>>>> P.S. Simple example of fastbw compared to step:
>>>>
>>>> library(MASS) ## for stepAIC and bwt data
>>>> example(birthwt)
>>>> library(rms)
>>>>
>>>> bwt.glm <- glm(low ~ ., family = binomial, data = bwt)
>>>> bwt.lrm <- lrm(low ~ ., data = bwt)
>>>>
>>>> step(bwt.glm)
>>>> ## same as stepAIC(bwt.glm)
>>>>
>>>> fastbw(bwt.lrm)
>>>
>>> Hi Ramon,
>>>
>>> By default fastbw uses type='residual' to compute test statistics on all
>>> deleted variables combined.  Use type='individual' to get the behavior in
>>> step.  In your example fastbw(..., type='ind') gives the same model as
>>> step() and comes surprisingly close to estimating the MLEs without
>>> refitting.  Of course you refit the reduced model to get MLEs.  Both true
>>> and approximate MLEs are biased by the variable selection so beware.
>>>  type=
>>> can be passed from calibrate or validate to fastbw.
>>>
>>> Note that none of the statistics computed by step or fastbw were designed
>>> to
>>> be used with more than two completely pre-specified models.  Variable
>>> selection is hazardous both to inference and to prediction. There is no
>>> free
>>> lunch; we are torturing data to confess its own sins.
>>>
>>> Frank
>>>
>>> --
>>> Frank E Harrell Jr   Professor and Chairman        School of Medicine
>>>                    Department of Biostatistics   Vanderbilt University
>>>
>>
>>
>

-- 
Ramon Diaz-Uriarte
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
Phone: +34-91-732-8000 ext. 3019