[R] memory and bootstrapping

Tim Hesterberg timhesterberg at gmail.com
Fri May 6 06:16:42 CEST 2011


(Regarding bootstrapping logistic regression.)

If the number of rows with Y=1 is small, it doesn't matter that n is huge.

If both number of successes and failures is huge, then ad Ripley notes
you can use asymptotic CIs.  The mean difference in predicted probabilities
is a nonlinear function of the coefficients, so you can use the delta method
to get standard errors.

In general, if you're not sure about normality and bias, you can use
the bootstrap to estimate how close to normal the sampling distribution
is.  The results may surprise you.  For example, for a one-sample mean,
if the population has skewness = 2 (like an exponential distn),
you need n=5000 before the CLT is reasonably accurate (actual 
one-sided non-coverage probabilities within 10% of the nominal, for a
95% interval).

Finally, you can speed up bootstrapping glms by using starting
values based on the coefficients estimated from the original data.
And, you can compute the model matrix once and resample rows of that
along with y, rather than computing a model matrix from scratch each time.

Tim Hesterberg

>The only reason the boot package will take more memory for 2000
>replications than 10 is that it needs to store the results.  That is
>not to say that on a 32-bit OS the fragmentation will not get worse,
>but that is unlikely to be a significant factor.
>
>As for the methodology: 'boot' is support software for a book, so
>please consult it (and not secondary sources).  From your brief
>description it looks to me as if you should be using studentized CIs.
>
>130,000 cases is a lot, and running the experiment on a 1% sample
>may well show that asymptotic CIs are good enough.
>
>On Thu, 5 May 2011, E Hofstadler wrote:
>
>> hello,
>>
>> the following questions will without doubt reveal some fundamental
>> ignorance, but hopefully you can still help me out.
>>
>> I'd like to bootstrap a coefficient gained on the basis of the
>> coefficients in a logistic regression model (the mean differences in
>> the predicted probabilities between two groups, where each predict()
>> operation uses as the newdata-argument a dataframe of equal size as
>> the original dataframe).I've got 130,000 rows and 7 columns in my
>> dataframe. The glm-model uses all variables (as well as two 2-way
>> interactions).
>>
>> System:
>> - R-version: 2.12.2
>> - OS: Windows XP Pro, 32-bit
>> - 3.16Ghz intel dual core processor, 2.9GB RAM
>>
>> I'm using the boot package to arrive at the standard errors for this
>> difference, but even with only 10 replications, this takes quite a
>> long time: 216 seconds (perhaps this is partly also due to my
>> inefficiently programmed function underlying the boot-call, I'm also
>> looking into that).
>>
>> I wanted to try out calculating a bca-bootstrapped confidence
>> interval, which as I understand requires a lot more replications than
>> normal-theory intervals. Drawing on John Fox' Appendix to his "An R
>> Companion to Applied Regression", I was thinking of trying out 2000
>> replications -- but this will take several hours to compute on my
>> system (which isn't in itself a major issue though).
>>
>> My Questions:
>> - let's say I try bootstrapping with 2000 replications. Can I be
>> certain that the memory available to R  will be sufficient for this
>> operation?
>> - (this relates to statistics more generally): is it a good idea in
>> your opinion to try bca-bootstrapping, or can it be assumed that a
>> normal theory confidence interval will be a sufficiently good
>> approximation (letting me get away with, say, 500 replications)?
>>
>>
>> Best,
>> Esther
>
>--
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
>



More information about the R-help mailing list