[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Thu Aug 9 21:49:34 CEST 2007

Try it as a factor:

> big2 <- rep(letters,length=1e6)
> object.size(big2)/1e6
[1] 4.000856
> object.size(as.factor(big2))/1e6
[1] 4.001184

> big3 <- paste(big2,big2,sep='')
> object.size(big3)/1e6
[1] 36.00002
> object.size(as.factor(big3))/1e6
[1] 4.001184

On 8/9/07, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
> On Thu, 9 Aug 2007, Michael Cassin wrote:
>
> > I really appreciate the advice and this database solution will be useful to
> > me for other problems, but in this case I  need to address the specific
> > problem of scan and read.* using so much memory.
> >
> > Is this expected behaviour? Can the memory usage be explained, and can it be
> > made more efficient?  For what it's worth, I'd be glad to try to help if the
> > code for scan is considered to be worth reviewing.
>
> Mike,
>
> This does not seem to be an issue with scan() per se.
>
> Notice the difference in size of big2, big3, and bigThree here:
>
> > big2 <- rep(letters,length=1e6)
> > object.size(big2)/1e6
> [1] 4.000856
> > big3 <- paste(big2,big2,sep='')
> > object.size(big3)/1e6
> [1] 36.00002
> >
> > cat(big2, file='lotsaletters.txt', sep='\n')
> > bigTwo <- scan('lotsaletters.txt',what='')
> Read 1000000 items
> > object.size(bigTwo)/1e6
> [1] 4.000856
> > cat(big3, file='moreletters.txt', sep='\n')
> > bigThree <- scan('moreletters.txt',what='')
> Read 1000000 items
> > object.size(bigThree)/1e6
> [1] 4.000856
> > all.equal(big3,bigThree)
> [1] TRUE
>
>
> Chuck
>
> p.s.
> > version
>                _
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status
> major          2
> minor          5.1
> year           2007
> month          06
> day            27
> svn rev        42083
> language       R
> version.string R version 2.5.1 (2007-06-27)
> >
>
> >
> > Regards, Mike
> >
> > On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> >>
> >> Just one other thing.
> >>
> >> The command in my prior post reads the data into an in-memory database.
> >> If you find that is a problem then you can read it into a disk-based
> >> database by adding the dbname argument to the sqldf call
> >> naming the database.  The database need not exist.  It will
> >> be created by sqldf and then deleted when its through:
> >>
> >> DF <- sqldf("select * from f", dbname = tempfile(),
> >>   file.format = list(header = TRUE, row.names = FALSE))
> >>
> >>
> >> On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> >>> Another thing you could try would be reading it into a data base and
> >> then
> >>> from there into R.
> >>>
> >>> The devel version of sqldf has this capability.   That is it will use
> >> RSQLite
> >>> to read the file directly into the database without going through R at
> >> all
> >>> and then read it from there into R so its a completely different
> >> process.
> >>> The RSQLite software has no capability of dealing with quotes (they will
> >>> be regarded as ordinary characters) but a single gsub can remove them
> >>> afterwards.  This won't work if there are commas within the quotes but
> >>> in that case you could read each row as a single record and then
> >>> split it yourself in R.
> >>>
> >>> Try this
> >>>
> >>> library(sqldf)
> >>> # next statement grabs the devel version software that does this
> >>> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
> >>>
> >>> gc()
> >>> f <- file("big.csv")
> >>> DF <- sqldf("select * from f", file.format = list(header = TRUE,
> >>> row.names = FALSE))
> >>> gc()
> >>>
> >>> For more info see the man page from the devel version and the home page:
> >>>
> >>> http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd
> >>> http://code.google.com/p/sqldf/
> >>>
> >>>
> >>> On 8/9/07, Michael Cassin <michael at cassin.name> wrote:
> >>>> Thanks for looking, but my file has quotes.  It's also 400MB, and I
> >> don't
> >>>> mind waiting, but don't have 6x the memory to read it in.
> >>>>
> >>>>
> >>>> On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> >>>>> If we add quote = FALSE to the write.csv statement its twice as fast
> >>>>> reading it in.
> >>>>>
> >>>>> On 8/9/07, Michael Cassin <michael at cassin.name> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've been having similar experiences and haven't been able to
> >>>>>> substantially improve the efficiency using the guidance in the I/O
> >>>>>> Manual.
> >>>>>>
> >>>>>> Could anyone advise on how to improve the following scan()?  It is
> >> not
> >>>>>> based on my real file, please assume that I do need to read in
> >>>>>> characters, and can't do any pre-processing of the file, etc.
> >>>>>>
> >>>>>> ## Create Sample File
> >>>>>>
> >>>> write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",
> >> row.names=FALSE)
> >>>>>> q()
> >>>>>>
> >>>>>> **New Session**
> >>>>>> #R
> >>>>>> system("ls -l big.csv")
> >>>>>> system("free -m")
> >>>>>>
> >>>> big1<-matrix(scan("big.csv
> >> ",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE)
> >>>>>> system("free -m")
> >>>>>>
> >>>>>> The file is approximately 9MB, but approximately 50-60MB is used
> >> to
> >>>>>> read it in.
> >>>>>>
> >>>>>> object.size(big1) is 56MB, or 56 bytes per string, which seems
> >>>> excessive.
> >>>>>>
> >>>>>> Regards, Mike
> >>>>>>
> >>>>>> Configuration info:
> >>>>>>> sessionInfo()
> >>>>>> R version 2.5.1 (2007-06-27)
> >>>>>> x86_64-redhat-linux-gnu
> >>>>>> locale:
> >>>>>> C
> >>>>>> attached base packages:
> >>>>>> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"
> >>>> "methods"
> >>>>>> [7] "base"
> >>>>>>
> >>>>>> # uname -a
> >>>>>> Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37
> >> MSD
> >>>>>> 2007 x86_64 x86_64 x86_64 GNU/Linux
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ====== Quoted Text ====
> >>>>>> From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
> >>>>>>  Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  The R Data Import/Export Manual points out several ways in which
> >> you
> >>>>>> can use read.csv more efficiently.
> >>>>>>
> >>>>>>  On Tue, 26 Jun 2007, ivo welch wrote:
> >>>>>>
> >>>>>> > dear R experts:
> >>>>>> >
> >>>>>>> I am of course no R experts, but use it regularly.  I thought I
> >> would
> >>>>>>> share some experimentation  with memory use.  I run a linux
> >> machine
> >>>>>>> with about 4GB of memory, and R 2.5.0.
> >>>>>>>
> >>>>>>> upon startup, gc() reports
> >>>>>>>
> >>>>>>>         used (Mb) gc trigger (Mb) max used (Mb)
> >>>>>>> Ncells 268755 14.4     407500 21.8   350000 18.7
> >>>>>>> Vcells 139137   1.1     786432  6.0   444750  3.4
> >>>>>>>
> >>>>>>> This is my baseline.  linux 'top' reports 48MB as
> >> baseline.  This
> >>>>>>> includes some of my own routines that are always loaded.  Good..
> >>>>>>>
> >>>>>>>
> >>>>>>> Next, I created a s.csv file with 22 variables and 500,000
> >>>>>>> observations, taking up an uncompressed disk space of
> >> 115MB.  The
> >>>>>>> resulting object.size() after a read.csv() is 84,002,712 bytes
> >> (80MB).
> >>>>>>>
> >>>>>>>> s= read.csv("s.csv");
> >>>>>>>> object.size(s);
> >>>>>>>
> >>>>>>> [1] 84002712
> >>>>>>>
> >>>>>>>
> >>>>>>> here is where things get more interesting.  after the read.csv()
> >> is
> >>>>>>> finished, gc() reports
> >>>>>>>
> >>>>>>>           used (Mb) gc trigger  (Mb) max used  (Mb)
> >>>>>>> Ncells   270505 14.5    8349948 446.0 11268682 601.9
> >>>>>>> Vcells 10639515 81.2   34345544 262.1 42834692 326.9
> >>>>>>>
> >>>>>>> I was a big surprised by this---R had 928MB intermittent memory
> >> in
> >>>>>>> use.  More interestingly, this is also similar to what linux
> >> 'top'
> >>>>>>> reports as memory use of the R process (919MB, probably 1024 vs.
> >> 1000
> >>>>>>> B/MB), even after the read.csv() is finished and gc() has been
> >> run.
> >>>>>>> Nothing seems to have been released back to the OS.
> >>>>>>>
> >>>>>>> Now,
> >>>>>>>
> >>>>>>>> rm(s)
> >>>>>>>> gc()
> >>>>>>>         used (Mb) gc trigger  (Mb) max used  (Mb)
> >>>>>>> Ncells 270541 14.5    6679958 356.8 11268755 601.9
> >>>>>>> Vcells 139481   1.1   27476536 209.7 42807620 326.6
> >>>>>>>
> >>>>>>> linux 'top' now reports 650MB of memory use (though R itself
> >> uses only
> >>>>>>> 15.6Mb).  My guess is that It leaves the trigger memory of 567MB
> >> plus
> >>>>>>> the base 48MB.
> >>>>>>>
> >>>>>>>
> >>>>>>> There are two interesting observations for me here:  first, to
> >> read a
> >>>>>>> .csv file, I need to have at least 10-15 times as much memory as
> >> the
> >>>>>>> file that I want to read---a lot more than the factor of 3-4
> >> that I
> >>>>>>> had expected.  The moral is that IF R can read a .csv file, one
> >> need
> >>>>>>> not worry too much about running into memory constraints
> >> lateron.  {R
> >>>>>>> Developers---reducing read.csv's memory requirement a little
> >> would be
> >>>>>>> nice.  of course, you have more than enough on your plate,
> >> already.}
> >>>>>>>
> >>>>>>> Second, memory is not returned fully to the OS.  This is not
> >>>>>>> necessarily a bad thing, but good to know.
> >>>>>>>
> >>>>>>> Hope this helps...
> >>>>>>>
> >>>>>>> Sincerely,
> >>>>>>>
> >>>>>>> /iaw
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help_at_stat.math.ethz.ch mailing list
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible
> >> code.
> >>>>>>>
> >>>>>>  --
> >>>>>> Brian D. Ripley,
> >>>> ripley_at_stats.ox.ac.uk
> >>>>>> Professor of Applied Statistics,
> >>>> http://www.stats.ox.ac.uk/~ripley/
> >>>>>> University of Oxford,             Tel:  +44 1865 272861 (self)
> >>>>>> 1 South Parks Road,                     +44 1865 272866 (PA)
> >>>>>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at stat.math.ethz.ch mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> Charles C. Berry                            (858) 534-2098
>                                             Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu               UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
>
>