[R] Antwort: Buying more computer for GLM

g.russell at eos-finance.com g.russell at eos-finance.com
Fri Sep 1 14:34:07 CEST 2006


Prof Brian Ripley wrote
> Probably not, but you have the ability to profile in R and find out.
Thanks.   This is certainly something I could check, and I shall do so.

> 
> 
> Some more comments;
> 
> 1) The Fortran code that underlies glm is that of lm.fit that only makes 

>    use of level-1 BLAS and so is not going to be helped greatly by an 
>    optimized BLAS.

I was afraid it might be something like that.
> 
> 2) No one has as far as I know succeeded in making a multithreaded 
>    Rblas.dll for Windows.  And under systems using pthreads, the success 

>    with multithreaded BLAS is very mixed, with it resulting in a 
dramatic 
>    slowdown in some problems.

I was afraid of that too.   Oh well.
> 
> 3) As I recall, you were doing model selection via AIC on 20,000 
>    observations.  You might want to think hard about that, since AIC is 
>    designed for good prediction.  I would do model exploration on a much 

>    smaller representative subset, and if I had 20,000 observations and 
30 
>    parameters and was interested in prediction, not do subset selection 
at 
>    all.

One problem is that some of the parameters in the learning set can be very 
highly 
correlated (I have no control over the observations), and I'm worried that 
if I 
don't prune away parameters which don't improve the log likelihood, my 
predictions will be 
busted by inputs which do not exhibit the same linear relationships as 
those of most of the 
learning set.   Of course in such a case you'd have to worry about the 
accuracy of the 
predictions anyway, but in my job we just have to get make the best 
predictions we can, 
even if they aren't perfect.

> 
> 4) glm() alllows you to specify starting parameters, which you could 
find 
>    from a subsample.  Very likely only 1 or 2 iterations would be 
needed.

This sounds like a good idea, but what in fact I do now is build a model 
using simple linear
regression (lm), which is very fast, in the hope that that will pick out 
the important parameters,
which I can then feed to glm.

Many thanks again!

George Russell



More information about the R-help mailing list