[R] win2k memory problem with merge()'ing repeatedly (long email)

Mon May 22 17:37:00 CEST 2006

Repeated merge()-ing does not always increase the space requirements
linearly. Keep in mind that a join between two tables where the same
value appears M and N times will produce M*N rows for that particular
value. My guess is that the number of rows in atot explodes because
you have some duplicate values in your files (having the same
duplicate date in each data frame would cause atot to contain 4, then
8, 16, 32, 64... rows for that date).

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Sean O'Riordain
> Sent: Monday, May 22, 2006 10:12 AM
> To: r-help
> Subject: [R] win2k memory problem with merge()'ing repeatedly
> (long email)
>
> Good afternoon,
>
> I have a 63 small .csv files which I process daily, and until two
> weeks ago they processed just fine and only took a matter of moments
> and had non noticeable memory problem.  Two weeks ago they have
> reached 318 lines and my script "broke".  There are some
> missing-values in some of the files.  I have tried hard many times
> over the last two weeks to create a "small" repeatable example to give
> you but I've failed - unless I use my data it works fine... :-(
>
> Am I missing something obvious? (again)
>
> A line in a typical file has lines which look like :
> 01/06/2005,1372
>
> Though there are three files which have two values (files 3,32,33) and
> these have lines which look like...
> 01/06/2005,1766,
> or
> 15/05/2006,289,114
>
> a1 <- read.csv("file1.csv",header=F)
> etc...
> a63 <- read.csv("file63.csv",header=F)
> names(a1) <- c("mdate","file1.column.description")
>
> atot <- merge(a1,a2,all=T)
>
> followed by repeatedly doing...
> atot <- merge(atot, a3,all=T)
> atot <- merge(atot, a4,all=T)
> etc...
>
> I normally start R with --vanilla.
>
> What appears to happen is that atot doubles in size each iteration and
> just falls over due to lack of memory at about i=17... even though the
> total memory required for all of these individual a1...a63 is only
> 1001384 bytes (doing an object.size() on a1..a63)
> at this point I've been trying to pin down this problem for two weeks
> and I just gave up...
>
> The following works fine as I'd expect with minimal memory usage...
>
> for (i in 3:67) {
>     datelist <- as.Date(start.date)+0:(count-1)
>     #remove a couple of elements...
>     datelist <- datelist[-(floor(runif(nacount)*count))]
>     a2 <- as.data.frame(datelist)
>     names(a2) <- "mdate"
>     vname <- paste("value", i, sep="")
>     a2[vname] <- runif(length(datelist))
>     #a2[floor(runif(nacount)*count), vname] <- NA
>
>     # atot <- merge(atot,a2,all=T)
>     i <- 2
>     a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="")
>     cat("a.eval.text is: -", a.eval.text, "-\n", sep="")
>     atot <- eval(parse(text=a.eval.text))
>
>     cat("i:", i, " ", gc(), "\n")
> }
>
> this works fine... but on my files (as per attached 'lastsave.txt'
> file) it just gobbles memory.
> Am I doing something wrong?  I (wrongly?) expected that repeatedly
> merge(atot,aN) would only increase the memory requirement linearly
> (with jumps perhaps as we go through a 2^n boundary)... which is what
> happens when merging simulated data.frames as above... no problem at
> all and its really fast...
>
> The attached text file shows a (slightly edited) session where the
> memory required by the merge() operation just doubles with each use...
> and I can only allow it to run until i=17!!!
>
> I've even run it with gctorture() set on... with similar, but
> excruciatingly slow results...
>
> Is there any relevant info that I'm missing?  Unfortunately I am not
> able to post the contents of the files to a public list like this...
>
> As per a previous thread, I know that I can use a list to handle these
> dataframes - but I had difficulty with the syntax of a list of
> dataframes...
>
> I'd like to know why the memory requirements for this merge
> just explode...
>
> cheers, (and thanks in advance!)
> Sean O'Riordain
>
> ==============================
> > version
>                _
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status         Patched
> major          2
> minor          3.0
> year           2006
> month          05
> day            09
> svn rev        38014
> language       R
> version.string Version 2.3.0 Patched (2006-05-09 r38014)
> >
> Running on Win2k with 1Gb ram.
>
> I also tried it (with the same results) on 2.2.1 and 2.3.0.
>
> ========================================================
>
> R : Copyright 2006, The R Foundation for Statistical Computing
> Version 2.3.0 Patched (2006-05-09 r38014)
> ISBN 3-900051-07-0
>
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>
>   Natural language support but running in an English locale
>
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
>
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
>
> > gc()
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 178186  4.8     407500 10.9   350000  9.4
> Vcells  73112  0.6     786432  6.0   333585  2.6
> > # take the information in the .csv files created from the emails
> > setwd("C:/Documents and Settings/c_oriordain_s/My
> Documents/pasip/mms/mms_emails")
> >
> > # the input file from Amdocs (as supplied by revenue assurance)
> > amdocs_csv_filename <- "amdocs_volumes_revised4.csv"
> > # where shall we put the output plot file
> > copypath <- "\\\\ient1dfs001\\general\\Process Improvement
> Projects\\Process Improvement Projects Repository\\Active
> Projects\\MMS\\01 Measure\\"
> >
> > # set to F (false) instead of T (true) if you're just
> tricking around and you don't
> > # want to be copying over files to the network drive all the time!
> > do.copy <- F
> >
> > # HOPEFULLY you shouldn't have to trick around with stuff
> below here!
> > #
>
> # EDIT file names changed to protect the innocent... :-)
>
> > a1 <-read.csv("file1.csv",header=F)
> #EDIT etc... all the way to
> > a63 <-read.csv("file63.csv", header=F)
> >
> > # now delete the now irrelevant initial date column for all
> 63 of these temporary objects...
> > for (i in 1:63) {
> +     # e.g. should look like a63$mdate <-
> as.Date(a63$V1,format="%d/%m/%Y")
> +     anum <- paste("a",i,sep="")
> +     eval(parse(text= paste(anum, "$mdate <- as.Date(" ,anum,
> "$V1,format=\"%d/%m/%Y\")",sep="") ))
> + }
> >
> >
> > # three files have three columns...
>
> #EDIT here again... to protect the innocent...
>
> > names(a3)[3] <- "2nd.column.name.in.file.3"
> > names(a32)[3] <- "2nd.column.name.in.file.32"
> > names(a33)[3] <- "2nd.column.name.in.file.33"
> >
> > # the rest only have two columns...
> >
> > names(a1)[2] <- "title.1"
> #EDIT
> > names(a63)[2] <- "title.63"
> >
> > for (i in 1:63) {
> +     # now delete the now irrelevant initial date column for all 63
> of these temporary objects...
> +     # e.g. should look like a33[1] <- NULL
> +     eval(parse(text=paste("a",i,"[1] <- NULL",sep="")))
> + }
> >
> > a.object.sizes <- vector()
> > for (i in 1:63) {
> +     # now delete these 63 temporary objects...
> +     # e.g. should look like rm(a33)
> +     a.name <- paste("a", i, sep="")
> +     # a.object.sizes[i] <- object.size(a.name)
> +     a.object.sizes[i] <-
> eval(parse(text=paste("object.size(",a.name,")", sep="")))
> + }
> >
> > a.object.sizes
>  [1] 17988 17996 19524 17996 17996 18004 17996 18028 17988 17988 17996
> 17996 17996 18012 18012 17988 17980 18004 18004
> [20] 18012 19348 19316 19340 17996 18004 18004 18012 18004 19228 19228
> 18012 19436 19436 19244 19220 17996 17900 17900
> [39] 17884 17884 17884 17884 17884 17884 17876 17988 17900 17892  8808
> 17988  8792  8800  8800  8792  8800  8784 17980
> [58] 17988 17980  9832  9728  9728  9728
> >
> > # merge these tables into one big dataframe...
> > atot <- merge(a1, a2, all=T)
> > for (i in 3:17) {
> +     # construct the text to be evaluated...
> +     #atot <- merge(atot, a3, all=T)
> +     cat("The size of object a", i, " is ",
> a.object.sizes[i], "\n", sep="")
> +     cat("The current size of atot is ", object.size(atot), "\n")
> +     a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="")
> +     cat("a.eval.text is: -", a.eval.text, "-\n", sep="")
> +     atot <- eval(parse(text=a.eval.text))
> +     cat("i is:", i, gc(), "\n\n")
> + }
> The size of object a3 is 19524
> The current size of atot is  19988
> a.eval.text is: -merge(atot, a3, all=T)-
> i is: 3 206289 137020 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
>
> The size of object a4 is 17996
> The current size of atot is  24300
> a.eval.text is: -merge(atot, a4, all=T)-
> i is: 4 206330 137402 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
>
> The size of object a5 is 17996
> The current size of atot is  28564
> a.eval.text is: -merge(atot, a5, all=T)-
> i is: 5 206411 138044 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
>
> The size of object a6 is 18004
> The current size of atot is  36044
> a.eval.text is: -merge(atot, a6, all=T)-
> i is: 6 206572 139246 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
>
> The size of object a7 is 17996
> The current size of atot is  50236
> a.eval.text is: -merge(atot, a7, all=T)-
> i is: 7 206893 141652 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
>
> The size of object a8 is 18028
> The current size of atot is  78516
> a.eval.text is: -merge(atot, a8, all=T)-
> i is: 8 207534 146614 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6
>
> The size of object a9 is 17988
> The current size of atot is  136252
> a.eval.text is: -merge(atot, a9, all=T)-
> i is: 9 208815 157016 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6
>
> The size of object a10 is 17988
> The current size of atot is  255404
> a.eval.text is: -merge(atot, a10, all=T)-
> i is: 10 211376 178938 5.7 1.4 407500 786432 10.9 6 362507
> 786425 9.7 6
>
> The size of object a11 is 17996
> The current size of atot is  502540
> a.eval.text is: -merge(atot, a11, all=T)-
> i is: 11 216497 225184 5.8 1.8 467875 889825 12.5 6.8 362507
> 888747 9.7 6.8
>
> The size of object a12 is 17996
> The current size of atot is  1015940
> a.eval.text is: -merge(atot, a12, all=T)-
> i is: 12 226738 322626 6.1 2.5 531268 1577138 14.2 12.1
> 362507 1569929 9.7 12
>
> The size of object a13 is 17996
> The current size of atot is  2082284
> a.eval.text is: -merge(atot, a13, all=T)-
> i is: 13 247219 527588 6.7 4.1 597831 2209110 16 16.9 362507
> 2749247 9.7 21
>
> The size of object a14 is 18012
> The current size of atot is  4295524
> a.eval.text is: -merge(atot, a14, all=T)-
> i is: 14 288180 957830 7.7 7.4 741108 4242831 19.8 32.4 494389 5296330
> 13.3 40.5
>
> The size of object a15 is 18012
> The current size of atot is  8884444
> a.eval.text is: -merge(atot, a15, all=T)-
> i is: 15 370101 1859128 9.9 14.2 1073225 8314706 28.7 63.5 781279
> 10388430 20.9 79.3
>
> The size of object a16 is 17988
> The current size of atot is  18388580
> a.eval.text is: -merge(atot, a16, all=T)-
> i is: 16 533942 3743450 14.3 28.6 1590760 17263040 42.5 131.8 1354559
> 21430459 36.2 163.6
>
> The size of object a17 is 17980
> The current size of atot is  38050756
> a.eval.text is: -merge(atot, a17, all=T)-
> i is: 17 861623 7675772 23.1 58.6 3094291 35309607 82.7 269.4 2501382
> 44137010 66.8 336.8
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>