[R] Bestglm subset analysis

Thu Jun 30 03:17:32 CEST 2016

Hi Doug,
To expand a bit on what Bert has written, all the the "best
subset/best model" procedures use random variation in the dataset to
produce a result. This means that you will almost certainly include
variables in your "best model" that cannot be replicated. Sometimes
you can see this as a variable that shouldn't make any difference to
the response variable on the basis of current knowledge is included.
You can often identify such problems with replication. Whenever you
use an automated procedure like this, it's up to you to provide
evidence that the result is not peculiar to the dataset, especially
when there are many measures taken, but on few cases.

Jim

On Thu, Jun 30, 2016 at 4:24 AM, D Wolf via R-help <r-help at r-project.org> wrote:
> Hello All,
> I am working on a linear regression model and trying to find the best subset of variables for my dataset. I have 21 predictors, 1 response variable, and 79 observations. I need to find the best 5 or 6 predictors for my model. I've used leaps for lm() and I'm now trying bestglm for glm(). I'm following this webpage, which gives the code below. https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
> My code:library(bestglm)library(base)lbw.for.bestglm <- within(df_Chl, {y <- df_Chl$Chloro })res.bestglm <- bestglm(Xy = lbw.for.bestglm, family = gaussian, IC = "AIC", method = "exhaustive")
> # get coefficientsres.bestglm$BestModelsHere is a sample of my results (I removed the 5th through 21st predictors for brevity).> res.bestglm$BestModels    R21   R31   R32   R41 1 FALSE FALSE FALSE FALSE  2 FALSE  TRUE FALSE FALSE  3 FALSE FALSE FALSE FALSE 4 FALSE  TRUE FALSE FALSE 5 FALSE  TRUE FALSE FALSE  Criterion1  326.73272  326.95253  327.06594  327.09125  327.8208
> Is it correct to assume I should keep variables that are TRUE from 1 through 5? What do those five rows represent?
> I know the AIC criterion result should be as low as possible. Is it possible to discern a good result for any of the IC criterion results, such as AIC, LOOCV, BICg, etc..? If BIC returns lower Criterion results, does that mean I need to use the BIC subset instead of the subset from AIC?
> Thank You,
> Doug
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.