[R] Is my data set too large

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Tue Dec 12 17:40:13 CET 2006


Aimin Yan wrote:
> I have a data set like this.
> I want to do glm, but I get this error:
>
> Error in model.matrix.default(mt, mf, contrasts) :
>          cannot allocate vector of length 932889958
>
> I am wondering if my data set is too large or I did something wrong.
>
> Is there some limitation for data size for R?
>
> thanks,
>
> Aimin
>
>
>  > p1982<- read.csv("p_1982_aa.csv")
>  > names(p1982)
> [1] "p"   "aa"  "as"  "ms"  "cur" "sc"
>  > str(p1982)
> 'data.frame':   465979 obs. of  6 variables:
>   $ p  : Factor w/ 1982 levels "154l_aa","1A0P_aa",..: 1 1 1 1 1 1 1 1 1 1 ...
>   $ aa : Factor w/ 19 levels "ALA","ARG","ASN",..: 2 16 4 5 18 3 19 3 2 9 ...
>   $ as : num  152.0  15.9  65.1  57.2  28.9 ...
>   $ ms : num  108.8  28.3  59.2  49.9  31.8 ...
>   $ cur: num  -0.1020  0.2564  0.0312 -0.0550  0.0526 ...
>   $ sc : num   92.10 103.67   7.27  72.98  96.12 ...
>  > attach(p1982)
>  > m<-glm(sc~p+aa+as+cur,data=p1982)
> Error in model.matrix.default(mt, mf, contrasts) :
>          cannot allocate vector of length 932889958
>   

Your "p" is a factor with many levels, so the design matrix for your
model is roughly 500000 x 2000. That gives 1 billion (US) entries of 8
bytes, so you need at least 8 GB just to store the design matrix. So
either you don't want "p" in the model or you have indeed exceeded your
capacity.
>  >
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   


-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907




More information about the R-help mailing list