[R] glmpath in R

Tue Apr 6 22:57:32 CEST 2010

Claire Wooton wrote:
> Steve Lianoglou <mailinglist.honeypot <at> gmail.com> writes:
> 
>> Hi Claire,
>>
>> I'm replying and CC-ing to the R-help list to get more eyes on your
>> question since others will likely have more/better advice, and perhaps
>> someone else in the future will have a similar question, and might
>> find this thread handy.
>>
>> I've removed your specific research aim since that might be private
>> information, but you can include that later if others find it
>> necessary to know in order help.
>>
>> On Apr 5, 2010, at 5:44 PM, Claire Wooton wrote:
>>
>>> Dear Steve,
>>>
>>> I came across your posting on the R-help mailing list concerning finding the
> best lambda in a LASSO-model,
>> and I was wondering whether you would be able to offer any advice based on
> your experience.
>>> I'm attempting to build a logistic regression model to explore [REDACTED]
> and recently decided to build a
>> LASSO-model, having learned of the problems with stepwise variable selection.
> While I've done a fair
>> amount of reading on the topic, I'm still a bit uncertain when it comes to
> selecting an appropriate value
>> for lambda when using the glmpath package.
>>> Any advice you could offer would be much appreciated.
>> In general, what I've done is to use cross validation to find this
>> "best" value for lambda, which I'm defining as the value of lambda
>> that gives me the model with the lowest "objective score" on my
>> testing data.
>>
>> The "objective score" is in quotes, because it can change given the
>> problem. For instance, for normal regression, the best objective score
>> could be the "lowest mean squared error" (or highest spearman rank) on
>> my held out examples. In your case, for logistic regression, this
>> could just be accuracy of the class labels.
>>
>> So, I do the CV and get 1 value of lambda for each fold in the CV that
>> returns the model that has the best generalization properties on held
>> out data. After doing the 10 fold cv (once, or many times), you could
>> take the avg. value for lambda and use that for my 'downstream
>> analysis' by building a model on all of my data with that value of
>> lambda.
>>
>> I'd also do some smoke tests to see how sensitive your model is w.r.t
>> the data it is given to train on. Do your best lambdas over each fold
>> vary a lot? How different is the model between folds -- are the same
>> predictor vars non-zero? What's their variance? Etc.
>>
>> Also, what's your objective in building the model? Do you just want
>> something with high predictive accuracy? Are you trying to draw
>> conclusions on the model that you build -- like infer meaning from its
>> coefs?
>>
>> This should probably go in the beginning of the email, but it's better
>> late than never:
>>
>> I should add the disclaimer that I'm not a "real statistician," and
>> I'm "calling uncle" in advance to the card carrying statisticians on
>> this list that might argue that (i) this approach isn't principled
>> enough, (ii) you shouldn't really take any statistical advice on a
>> mailing list; and (iii) you'd be best off consulting a local
>> statistician.
>>
>> Does that answer your question? If not, could you elaborate more about
>> what you're after?
>>
>> Please don't forget to CC the R-help list on any further communication.
>>
>> Thanks,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>>
> Hi Steve,
> 
> Thanks very much for your reply. My main objective in building the model is to
> determine the relative strength of the variables in predicting my
> presence/absence data. It's really an exploratory method, I'm  interested in
> whether the associations that have been observed out in the field come out in
> the model. I'm also using rpart to build a classification tree to get a sense of
> any interactions. 

rpart is not able to do that.  Apparent interactions from trees are more 
often than not spurious.  To see this, simulate a dataset where males 
have an age range of 10-90 and females have a range 40-50.  You will see 
splits on age for males but not for females.  This has nothing to do 
with interactions.

Frank

> 
> I was planning to use cross-validation to identify a value of lambda that gives
> minimum mean cv error and the largest value of lambda such that error is within
> 1 SE of the minimum. I'm not entirely sure how to proceed in building the full
> model using this value of lambda. At this point do I simply use predict.glmpath
> (or predict.glmnet) setting the value of "s" to lambda and return the
> coefficients? I plan to validate the chosen coefficient estimates through a
> bootstrap analysis. 
> 
> Beyond conducting this "smoke test", I'm wondering how I should assess the
> resulting model. Can I assess the fit and predictive accuracy of a glmnet object?
> 
> Thanks again for your help. I am also planning on discussing my work with a
> professor in statistics. I appreciate the insight though as I attempt to wrap my
> head around these methods. 
> 
> Cheers,
> 
> Claire
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University