[R] regsubsets() [leaps package] - please share some good examples of use

Thu Mar 12 15:12:59 CET 2009

If you run the example from ?biglm

data(trees)
ff<-log(Volume)~log(Girth)+log(Height)
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
summary(a)

you can then do

b <-regsubsets(a, method="forward")
summary(b)

to get the results of forward selection.  In general, the biglm fit is the `maximum' model for the forward selection: all the variables that you want to consider for inclusion.

      -thomas

On Thu, 12 Mar 2009, Tal Galili wrote:

> Hello dear R-help members,
>
> I recently became interested in using biglm with leaps, and found myself
> somewhat confused as to how to use the two together, in different settings.
>
> I couldn't find any example codes for the leaps() package (except for in the
> help file, and the examples there are not as rich as they could be).  That
> is why I turn to you in case you could share some good tips and examples of
> code on how to use the leaps package (especially the regsubsets command)
>
>
> The problem that drives me to ask this is: how to use the regsubsets()
> command to immulate a forward model selection procedure on a regressions
> problem ?
>
> I attach below a few direction dear Thomas has already wrote to me on the
> subject, and any help would be very welcomed:
>
> *me:*
> I feel I am missing a big something here, so please help me here -
> Let's say we have a dataset with an X matrix of 10 variables, and all we
> want to perform is forward variable selection with AIC, starting from
> the minimal model that includes the intercept only, and with the maximum
> model of all variable and their interaction up to the second order.
> In that range, we wish to find the best model, based on forward selection.
>
> *Thomas:*
> Use biglm() to fit the model with all main effects and all second order
> interactions.  This model will be the maximum model for selection.
>
> The minimum model, by default, is the model with only an intercept, so you
> don't need to specify anything.  If the minimum model is more complicated,
> the vector force.in specifies which terms are in the minimum model (a
> logical vector with TRUE for variables in the minimum model and FALSE for
> variables not in the minimum model).
>
> regsubsets() will give you the best model with one variable, the best with
> two variables, and so on. The object produced by summary() of the
> regsubsets() has a component $cp that gives Mallows' Cp for each of the best
> models. This is equivalent to AIC, or you can compute AIC from the residual
> sum of squares in the $rss component of the object.
>
> regsubsets() doesn't actually fit the models, it just works out the residual
> sum of squares. You need to take the output of regsubsets() and then fit
> which ever of the best models you want coefficients for.
> summary(regsubsets.object)$which is a logical matrix indicating which
> variables are in each of the best models.
> This may seem unnecessarily complicated, but regsubsets() was designed for
> situations where you want lots of best models rather than just one, since
> there are often lots of models that are about equally good.  That's the
> point of the
> plot() method, where you can look at hundreds of best models from 30 or so
> variables and see which variables are in most of the good models, and which
> variables tend to occur together or separately -- for example, if you have
> two related variables such as systolic blood pressure and diastolic blood
> pressure do they substitute for each other or do they tend to occur in the
> same model.
>
>
>
> Thanks all (and again - thanks Thomas for all your patient answers so far)
> Tal
>
>
>
> p.s: I already sent this e-mail once, but couldn't seem to see it on the
> list, so I resent it again - sorry if any of you got it twice.
>
>
>
> ----------------------------------------------
>
>
> My contact information:
> Tal Galili
> Phone number: 972-50-3373767
> FaceBook: Tal Galili
> My Blogs:
> www.talgalili.com
> www.biostatistics.co.il
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle