[R] Large data sets and memory management in R.

Peter Dalgaard p.dalgaard at biostat.ku.dk
Wed Jan 28 22:18:39 CET 2004


gerald.jean at dgag.ca writes:

> library(package = "statmod", pos = 2,
>         lib.loc = "/home/jeg002/R-1.8.1/lib/R/R_LIBS")
> 
> qc.B3.tweedie <- glm(formula = pp20B3 ~ ageveh + anpol +
>                      categveh + champion + cie + dossiera +
>                      faq13c + faq5a + kmaff + kmprom + nbvt +
>                      rabprof + sexeprin + newage,
>                      family = tweedie(var.power = 1.577,
>                        link.power = 0),
>                      etastart = log(rep(mean(qc.b3.sans.occ[,
>                         'pp20B3']), nrow(qc.b3.sans.occ))),
>                      weights = unsb3t1,
>                      trace = T,
>                      data = qc.b3.sans.occ)
> 
> After one iteration (45+ minutes) R is trashing through over 10Gb of
> memory.
> 
> Thanks for any insights,

Well, I don't know how much it helps; you are in somewhat uncharted
territory there. I suppose the dataset comes to 0.5-1GB all by itself?

One thing that I note is that you have 60 variables, but use only 15.
Perhaps it helps to remove some of them before the run? 

How large does the designmatrix get? If some of those variables have a
lot of levels, it could explain the phenomenon. Any chance that a
continuous variable got recorded as a factor?

        -p

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907




More information about the R-help mailing list