[R] win2k memory problem with merge()'ing repeatedly (long email)

Sean O'Riordain seanpor at acm.org
Mon May 22 17:50:09 CEST 2006


Thank you very much indeed Bogdan!

> a2[duplicated(a2$mdate),]
                      value2     mdate
318                        0 2006-05-10
322                        0 2006-05-13
324                        0 2006-05-14
326                        0 2006-05-15
328                        0 2006-05-16

What a relief to know what is causing this problem... now to sort out
the root cause!

cheers and thanks again!
Sean


On 22/05/06, bogdan romocea <br44114 at gmail.com> wrote:
> Repeated merge()-ing does not always increase the space requirements
> linearly. Keep in mind that a join between two tables where the same
> value appears M and N times will produce M*N rows for that particular
> value. My guess is that the number of rows in atot explodes because
> you have some duplicate values in your files (having the same
> duplicate date in each data frame would cause atot to contain 4, then
> 8, 16, 32, 64... rows for that date).
>
>
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Sean O'Riordain
> > Sent: Monday, May 22, 2006 10:12 AM
> > To: r-help
> > Subject: [R] win2k memory problem with merge()'ing repeatedly
> > (long email)
> >
> > Good afternoon,
> >
> > I have a 63 small .csv files which I process daily, and until two
> > weeks ago they processed just fine and only took a matter of moments
> > and had non noticeable memory problem.  Two weeks ago they have
> > reached 318 lines and my script "broke".  There are some
> > missing-values in some of the files.  I have tried hard many times
> > over the last two weeks to create a "small" repeatable example to give
> > you but I've failed - unless I use my data it works fine... :-(
> >
> > Am I missing something obvious? (again)
> >
> > A line in a typical file has lines which look like :
> > 01/06/2005,1372
> >
> > Though there are three files which have two values (files 3,32,33) and
> > these have lines which look like...
> > 01/06/2005,1766,
> > or
> > 15/05/2006,289,114
> >
> > a1 <- read.csv("file1.csv",header=F)
> > etc...
> > a63 <- read.csv("file63.csv",header=F)
> > names(a1) <- c("mdate","file1.column.description")
> >
> > atot <- merge(a1,a2,all=T)
> >
> > followed by repeatedly doing...
> > atot <- merge(atot, a3,all=T)
> > atot <- merge(atot, a4,all=T)
> > etc...
> >
> > I normally start R with --vanilla.
> >
> > What appears to happen is that atot doubles in size each iteration and
> > just falls over due to lack of memory at about i=17... even though the
> > total memory required for all of these individual a1...a63 is only
> > 1001384 bytes (doing an object.size() on a1..a63)
> > at this point I've been trying to pin down this problem for two weeks
> > and I just gave up...
> >
> > The following works fine as I'd expect with minimal memory usage...
> >
> > for (i in 3:67) {
> >     datelist <- as.Date(start.date)+0:(count-1)
> >     #remove a couple of elements...
> >     datelist <- datelist[-(floor(runif(nacount)*count))]
> >     a2 <- as.data.frame(datelist)
> >     names(a2) <- "mdate"
> >     vname <- paste("value", i, sep="")
> >     a2[vname] <- runif(length(datelist))
> >     #a2[floor(runif(nacount)*count), vname] <- NA
> >
> >     # atot <- merge(atot,a2,all=T)
> >     i <- 2
> >     a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="")
> >     cat("a.eval.text is: -", a.eval.text, "-\n", sep="")
> >     atot <- eval(parse(text=a.eval.text))
> >
> >     cat("i:", i, " ", gc(), "\n")
> > }
> >
> > this works fine... but on my files (as per attached 'lastsave.txt'
> > file) it just gobbles memory.
> > Am I doing something wrong?  I (wrongly?) expected that repeatedly
> > merge(atot,aN) would only increase the memory requirement linearly
> > (with jumps perhaps as we go through a 2^n boundary)... which is what
> > happens when merging simulated data.frames as above... no problem at
> > all and its really fast...
> >
> > The attached text file shows a (slightly edited) session where the
> > memory required by the merge() operation just doubles with each use...
> > and I can only allow it to run until i=17!!!
> >
> > I've even run it with gctorture() set on... with similar, but
> > excruciatingly slow results...
> >
> > Is there any relevant info that I'm missing?  Unfortunately I am not
> > able to post the contents of the files to a public list like this...
> >
> > As per a previous thread, I know that I can use a list to handle these
> > dataframes - but I had difficulty with the syntax of a list of
> > dataframes...
> >
> > I'd like to know why the memory requirements for this merge
> > just explode...
> >
> > cheers, (and thanks in advance!)
> > Sean O'Riordain
> >
> > ==============================
> > > version
> >                _
> > platform       i386-pc-mingw32
> > arch           i386
> > os             mingw32
> > system         i386, mingw32
> > status         Patched
> > major          2
> > minor          3.0
> > year           2006
> > month          05
> > day            09
> > svn rev        38014
> > language       R
> > version.string Version 2.3.0 Patched (2006-05-09 r38014)
> > >
> > Running on Win2k with 1Gb ram.
> >
> > I also tried it (with the same results) on 2.2.1 and 2.3.0.
> >
> > ========================================================
> >
> > R : Copyright 2006, The R Foundation for Statistical Computing
> > Version 2.3.0 Patched (2006-05-09 r38014)
> > ISBN 3-900051-07-0
> >
> > R is free software and comes with ABSOLUTELY NO WARRANTY.
> > You are welcome to redistribute it under certain conditions.
> > Type 'license()' or 'licence()' for distribution details.
> >
> >   Natural language support but running in an English locale
> >
> > R is a collaborative project with many contributors.
> > Type 'contributors()' for more information and
> > 'citation()' on how to cite R or R packages in publications.
> >
> > Type 'demo()' for some demos, 'help()' for on-line help, or
> > 'help.start()' for an HTML browser interface to help.
> > Type 'q()' to quit R.
> >
> > > gc()
> >          used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 178186  4.8     407500 10.9   350000  9.4
> > Vcells  73112  0.6     786432  6.0   333585  2.6
> > > # take the information in the .csv files created from the emails
> > > setwd("C:/Documents and Settings/c_oriordain_s/My
> > Documents/pasip/mms/mms_emails")
> > >
> > > # the input file from Amdocs (as supplied by revenue assurance)
> > > amdocs_csv_filename <- "amdocs_volumes_revised4.csv"
> > > # where shall we put the output plot file
> > > copypath <- "\\\\ient1dfs001\\general\\Process Improvement
> > Projects\\Process Improvement Projects Repository\\Active
> > Projects\\MMS\\01 Measure\\"
> > >
> > > # set to F (false) instead of T (true) if you're just
> > tricking around and you don't
> > > # want to be copying over files to the network drive all the time!
> > > do.copy <- F
> > >
> > > # HOPEFULLY you shouldn't have to trick around with stuff
> > below here!
> > > #
> >
> > # EDIT file names changed to protect the innocent... :-)
> >
> > > a1 <-read.csv("file1.csv",header=F)
> > #EDIT etc... all the way to
> > > a63 <-read.csv("file63.csv", header=F)
> > >
> > > # now delete the now irrelevant initial date column for all
> > 63 of these temporary objects...
> > > for (i in 1:63) {
> > +     # e.g. should look like a63$mdate <-
> > as.Date(a63$V1,format="%d/%m/%Y")
> > +     anum <- paste("a",i,sep="")
> > +     eval(parse(text= paste(anum, "$mdate <- as.Date(" ,anum,
> > "$V1,format=\"%d/%m/%Y\")",sep="") ))
> > + }
> > >
> > >
> > > # three files have three columns...
> >
> > #EDIT here again... to protect the innocent...
> >
> > > names(a3)[3] <- "2nd.column.name.in.file.3"
> > > names(a32)[3] <- "2nd.column.name.in.file.32"
> > > names(a33)[3] <- "2nd.column.name.in.file.33"
> > >
> > > # the rest only have two columns...
> > >
> > > names(a1)[2] <- "title.1"
> > #EDIT
> > > names(a63)[2] <- "title.63"
> > >
> > > for (i in 1:63) {
> > +     # now delete the now irrelevant initial date column for all 63
> > of these temporary objects...
> > +     # e.g. should look like a33[1] <- NULL
> > +     eval(parse(text=paste("a",i,"[1] <- NULL",sep="")))
> > + }
> > >
> > > a.object.sizes <- vector()
> > > for (i in 1:63) {
> > +     # now delete these 63 temporary objects...
> > +     # e.g. should look like rm(a33)
> > +     a.name <- paste("a", i, sep="")
> > +     # a.object.sizes[i] <- object.size(a.name)
> > +     a.object.sizes[i] <-
> > eval(parse(text=paste("object.size(",a.name,")", sep="")))
> > + }
> > >
> > > a.object.sizes
> >  [1] 17988 17996 19524 17996 17996 18004 17996 18028 17988 17988 17996
> > 17996 17996 18012 18012 17988 17980 18004 18004
> > [20] 18012 19348 19316 19340 17996 18004 18004 18012 18004 19228 19228
> > 18012 19436 19436 19244 19220 17996 17900 17900
> > [39] 17884 17884 17884 17884 17884 17884 17876 17988 17900 17892  8808
> > 17988  8792  8800  8800  8792  8800  8784 17980
> > [58] 17988 17980  9832  9728  9728  9728
> > >
> > > # merge these tables into one big dataframe...
> > > atot <- merge(a1, a2, all=T)
> > > for (i in 3:17) {
> > +     # construct the text to be evaluated...
> > +     #atot <- merge(atot, a3, all=T)
> > +     cat("The size of object a", i, " is ",
> > a.object.sizes[i], "\n", sep="")
> > +     cat("The current size of atot is ", object.size(atot), "\n")
> > +     a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="")
> > +     cat("a.eval.text is: -", a.eval.text, "-\n", sep="")
> > +     atot <- eval(parse(text=a.eval.text))
> > +     cat("i is:", i, gc(), "\n\n")
> > + }
> > The size of object a3 is 19524
> > The current size of atot is  19988
> > a.eval.text is: -merge(atot, a3, all=T)-
> > i is: 3 206289 137020 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
> >
> > The size of object a4 is 17996
> > The current size of atot is  24300
> > a.eval.text is: -merge(atot, a4, all=T)-
> > i is: 4 206330 137402 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
> >
> > The size of object a5 is 17996
> > The current size of atot is  28564
> > a.eval.text is: -merge(atot, a5, all=T)-
> > i is: 5 206411 138044 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
> >
> > The size of object a6 is 18004
> > The current size of atot is  36044
> > a.eval.text is: -merge(atot, a6, all=T)-
> > i is: 6 206572 139246 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
> >
> > The size of object a7 is 17996
> > The current size of atot is  50236
> > a.eval.text is: -merge(atot, a7, all=T)-
> > i is: 7 206893 141652 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6
> >
> > The size of object a8 is 18028
> > The current size of atot is  78516
> > a.eval.text is: -merge(atot, a8, all=T)-
> > i is: 8 207534 146614 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6
> >
> > The size of object a9 is 17988
> > The current size of atot is  136252
> > a.eval.text is: -merge(atot, a9, all=T)-
> > i is: 9 208815 157016 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6
> >
> > The size of object a10 is 17988
> > The current size of atot is  255404
> > a.eval.text is: -merge(atot, a10, all=T)-
> > i is: 10 211376 178938 5.7 1.4 407500 786432 10.9 6 362507
> > 786425 9.7 6
> >
> > The size of object a11 is 17996
> > The current size of atot is  502540
> > a.eval.text is: -merge(atot, a11, all=T)-
> > i is: 11 216497 225184 5.8 1.8 467875 889825 12.5 6.8 362507
> > 888747 9.7 6.8
> >
> > The size of object a12 is 17996
> > The current size of atot is  1015940
> > a.eval.text is: -merge(atot, a12, all=T)-
> > i is: 12 226738 322626 6.1 2.5 531268 1577138 14.2 12.1
> > 362507 1569929 9.7 12
> >
> > The size of object a13 is 17996
> > The current size of atot is  2082284
> > a.eval.text is: -merge(atot, a13, all=T)-
> > i is: 13 247219 527588 6.7 4.1 597831 2209110 16 16.9 362507
> > 2749247 9.7 21
> >
> > The size of object a14 is 18012
> > The current size of atot is  4295524
> > a.eval.text is: -merge(atot, a14, all=T)-
> > i is: 14 288180 957830 7.7 7.4 741108 4242831 19.8 32.4 494389 5296330
> > 13.3 40.5
> >
> > The size of object a15 is 18012
> > The current size of atot is  8884444
> > a.eval.text is: -merge(atot, a15, all=T)-
> > i is: 15 370101 1859128 9.9 14.2 1073225 8314706 28.7 63.5 781279
> > 10388430 20.9 79.3
> >
> > The size of object a16 is 17988
> > The current size of atot is  18388580
> > a.eval.text is: -merge(atot, a16, all=T)-
> > i is: 16 533942 3743450 14.3 28.6 1590760 17263040 42.5 131.8 1354559
> > 21430459 36.2 163.6
> >
> > The size of object a17 is 17980
> > The current size of atot is  38050756
> > a.eval.text is: -merge(atot, a17, all=T)-
> > i is: 17 861623 7675772 23.1 58.6 3094291 35309607 82.7 269.4 2501382
> > 44137010 66.8 336.8
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>



More information about the R-help mailing list