[R] Confidence intervals and polynomial fits

Ben Haller rhelp at sticksoftware.com
Fri May 6 22:16:12 CEST 2011


On May 6, 2011, at 1:58 PM, Prof Brian Ripley wrote:

> On Fri, 6 May 2011, Bert Gunter wrote:
> 
>> FWIW:
>> 
>> Fitting higher order polynomials (say > 2) is almost always a bad idea.
>> 
>> See e.g.  the Hastie, Tibshirani, et. al book on "Statistical
>> Learning" for a detailed explanation why. The Wikipedia entry on
>> "smoothing splines" also contains a brief explanation, I believe.
>> 
>> Your ~0 P values for the coefficients also suggest problems/confusion
>> (!) -- perhaps you need to consider something along the lines of
>> "functional data analysis"  for your analysis.
>> 
>> Having no knowledge of your issues, these remarks are entirely
>> speculative and may well be wrong. So feel free to dismiss.
>> Nevertheless, you may find it useful to consult your local
>> statistician for help.
> 
> That is the main piece of advice I would have given.  But if you must DIY, consider the merits of orthogonal polynomials.  Computing individual confidence intervals for highly correlated coefficients is very dubious practice.  Without the example the posting guide asked for, we can only guess if that is what is happening.

  Thanks to both of you.  Yes, I am admittedly out of my depth; the statistician I would normally ask is on sabbatical, and I'm a bit at sea.  Of course McGill has a whole department of mathematics and statistics; I guess I ought to try to make a friend over there (I'm in the biology department).  Anyhow, I've just downloaded the Hastie et al. book and will try to figure out whether my use of higher order polynomials is incorrect in my situation.  Eliminating those would certainly solve my problem with the confidence intervals.  :->

  I was figuring that the ~0 P-values for coefficients was just the result of my having 300,000 data points; I figured the regression procedure was thus able to pin down very accurate estimates of them.  I'll look into "functional data analysis" as you recommend, though; I'm entirely unfamiliar with it.

  As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly correlated, for values close to zero.  Is this what you mean, as a potential source of problems?  Or if you mean that the various other terms in my model might be correlated with x, that is not the case; each independent variable is completely uncorrelated with the others (this data comes from simulations, so the independent variables for each data point were in fact chosen by random drawing).

  It didn't seem easy to post an example, since my dataset is so large, but if either of you would be willing to look at this further, I could upload my dataset to a web server somewhere and post a link to it.  In any case, thanks very much for your help; I'll look into the things you mentioned.

Ben Haller
McGill University

http://biology.mcgill.ca/grad/ben/



More information about the R-help mailing list