[R] glmpath in R

Tue Apr 6 17:21:34 CEST 2010

Hi Claire,

I'm replying and CC-ing to the R-help list to get more eyes on your
question since others will likely have more/better advice, and perhaps
someone else in the future will have a similar question, and might
find this thread handy.

I've removed your specific research aim since that might be private
information, but you can include that later if others find it
necessary to know in order help.

On Apr 5, 2010, at 5:44 PM, Claire Wooton wrote:

> Dear Steve,
>
> I came across your posting on the R-help mailing list concerning finding the best lambda in a LASSO-model, and I was wondering whether you would be able to offer any advice based on your experience.
>
> I'm attempting to build a logistic regression model to explore [REDACTED] and recently decided to build a LASSO-model, having learned of the problems with stepwise variable selection. While I've done a fair amount of reading on the topic, I'm still a bit uncertain when it comes to selecting an appropriate value for lambda when using the glmpath package.
>
> Any advice you could offer would be much appreciated.

In general, what I've done is to use cross validation to find this
"best" value for lambda, which I'm defining as the value of lambda
that gives me the model with the lowest "objective score" on my
testing data.

The "objective score" is in quotes, because it can change given the
problem. For instance, for normal regression, the best objective score
could be the "lowest mean squared error" (or highest spearman rank) on
my held out examples. In your case, for logistic regression, this
could just be accuracy of the class labels.

So, I do the CV and get 1 value of lambda for each fold in the CV that
returns the model that has the best generalization properties on held
out data. After doing the 10 fold cv (once, or many times), you could
take the avg. value for lambda and use that for my 'downstream
analysis' by building a model on all of my data with that value of
lambda.

I'd also do some smoke tests to see how sensitive your model is w.r.t
the data it is given to train on. Do your best lambdas over each fold
vary a lot? How different is the model between folds -- are the same
predictor vars non-zero? What's their variance? Etc.

Also, what's your objective in building the model? Do you just want
something with high predictive accuracy? Are you trying to draw
conclusions on the model that you build -- like infer meaning from its
coefs?

This should probably go in the beginning of the email, but it's better
late than never:

I should add the disclaimer that I'm not a "real statistician," and
I'm "calling uncle" in advance to the card carrying statisticians on
this list that might argue that (i) this approach isn't principled
enough, (ii) you shouldn't really take any statistical advice on a
mailing list; and (iii) you'd be best off consulting a local
statistician.

Does that answer your question? If not, could you elaborate more about
what you're after?

Please don't forget to CC the R-help list on any further communication.

Thanks,
-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact