[R] glmpath in R

Tue Apr 6 18:31:15 CEST 2010

Steve Lianoglou <mailinglist.honeypot <at> gmail.com> writes:

> 
> Hi Claire,
> 
> I'm replying and CC-ing to the R-help list to get more eyes on your
> question since others will likely have more/better advice, and perhaps
> someone else in the future will have a similar question, and might
> find this thread handy.
> 
> I've removed your specific research aim since that might be private
> information, but you can include that later if others find it
> necessary to know in order help.
> 
> On Apr 5, 2010, at 5:44 PM, Claire Wooton wrote:
> 
> > Dear Steve,
> >
> > I came across your posting on the R-help mailing list concerning finding the
best lambda in a LASSO-model,
> and I was wondering whether you would be able to offer any advice based on
your experience.
> >
> > I'm attempting to build a logistic regression model to explore [REDACTED]
and recently decided to build a
> LASSO-model, having learned of the problems with stepwise variable selection.
While I've done a fair
> amount of reading on the topic, I'm still a bit uncertain when it comes to
selecting an appropriate value
> for lambda when using the glmpath package.
> >
> > Any advice you could offer would be much appreciated.
> 
> In general, what I've done is to use cross validation to find this
> "best" value for lambda, which I'm defining as the value of lambda
> that gives me the model with the lowest "objective score" on my
> testing data.
> 
> The "objective score" is in quotes, because it can change given the
> problem. For instance, for normal regression, the best objective score
> could be the "lowest mean squared error" (or highest spearman rank) on
> my held out examples. In your case, for logistic regression, this
> could just be accuracy of the class labels.
> 
> So, I do the CV and get 1 value of lambda for each fold in the CV that
> returns the model that has the best generalization properties on held
> out data. After doing the 10 fold cv (once, or many times), you could
> take the avg. value for lambda and use that for my 'downstream
> analysis' by building a model on all of my data with that value of
> lambda.
> 
> I'd also do some smoke tests to see how sensitive your model is w.r.t
> the data it is given to train on. Do your best lambdas over each fold
> vary a lot? How different is the model between folds -- are the same
> predictor vars non-zero? What's their variance? Etc.
> 
> Also, what's your objective in building the model? Do you just want
> something with high predictive accuracy? Are you trying to draw
> conclusions on the model that you build -- like infer meaning from its
> coefs?
> 
> This should probably go in the beginning of the email, but it's better
> late than never:
> 
> I should add the disclaimer that I'm not a "real statistician," and
> I'm "calling uncle" in advance to the card carrying statisticians on
> this list that might argue that (i) this approach isn't principled
> enough, (ii) you shouldn't really take any statistical advice on a
> mailing list; and (iii) you'd be best off consulting a local
> statistician.
> 
> Does that answer your question? If not, could you elaborate more about
> what you're after?
> 
> Please don't forget to CC the R-help list on any further communication.
> 
> Thanks,
> -steve
> 
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> 
Hi Steve,

Thanks very much for your reply. My main objective in building the model is to
determine the relative strength of the variables in predicting my
presence/absence data. It's really an exploratory method, I'm  interested in
whether the associations that have been observed out in the field come out in
the model. I'm also using rpart to build a classification tree to get a sense of
any interactions. 

I was planning to use cross-validation to identify a value of lambda that gives
minimum mean cv error and the largest value of lambda such that error is within
1 SE of the minimum. I'm not entirely sure how to proceed in building the full
model using this value of lambda. At this point do I simply use predict.glmpath
(or predict.glmnet) setting the value of "s" to lambda and return the
coefficients? I plan to validate the chosen coefficient estimates through a
bootstrap analysis. 

Beyond conducting this "smoke test", I'm wondering how I should assess the
resulting model. Can I assess the fit and predictive accuracy of a glmnet object?

Thanks again for your help. I am also planning on discussing my work with a
professor in statistics. I appreciate the insight though as I attempt to wrap my
head around these methods. 

Cheers,

Claire