[Rd] R 3.0.0 memory use

Tim Hesterberg timhesterberg at gmail.com
Mon Apr 15 06:07:05 CEST 2013


When I change the data set size, the "extra allocations" do
not change in size. This supports Luke and Martin's diagnosis.

The extra allocations are either 2 or 4 allocations each of size
 80040
240048
320040

Details (you may skip):

(Fresh session of R 3.0.0)
> y <- 1:10^4 + 0.0
> Rprofmem("temp.out", threshold = 10^4)
> d <- as.data.frame(y)
> Rprofmem(NULL); system("cat temp.out")
320040 :80040 :240048 :320040 :80040 :240048 :80040 :"as.data.frame.numeric" "as.data.frame" 
320040 :80040 :240048 :320040 :80040 :240048 :> 
> # Try increasing size by a factor of 10
> y <- 1:10^5 + 0.0
> Rprofmem("temp.out", threshold = 10^4)
> d <- as.data.frame(y)
> Rprofmem(NULL); system("cat temp.out")
320040 :80040 :240048 :320040 :80040 :240048 :800040 :"as.data.frame.numeric" "as.data.frame" 
320040 :80040 :240048 :320040 :80040 :240048 :> 

The number of allocations shown, of different sizes:

        3.0.0   3.0.0   2.15.3  2.15.3
        first   second  first   second
240048  4       4       0       0
320040  4       4       0       0
 80040  5       4       1       0
800040  0       1       0       1

So it looks like both R 2.15.3 and R 3.0.0 are making
one copy of the data, plus extra allocations.

(Fresh session of R 2.15.3)
> y <- 1:10^4 + 0.0
> Rprofmem("temp.out", threshold = 10^4)
> d <- as.data.frame(y)
> Rprofmem(NULL); system("cat temp.out")
80040 :"as.data.frame.numeric" "as.data.frame" 
> # Increase size by factor of 10
> y <- 1:10^5 + 0.0
> Rprofmem("temp.out", threshold = 10^4)
> d <- as.data.frame(y)
> Rprofmem(NULL); system("cat temp.out")
800040 :"as.data.frame.numeric" "as.data.frame" 




On Sun, 14 Apr 2013 19:15:45 -0700 Martin Morgan <mtmorgan at fhcrc.org> wrote:
>On 04/14/2013 07:11 PM, luke-tierney at uiowa.edu wrote:
>> There were a couple of bug fixes to somewhat obscure compound
>> assignment related bugs that required bumping up internal reference
>> counts. It's possible that one or more of these are responsible. If so
>> it is unavoidable for now, but it's worth finding out for sure. With
>> some stripped down test examples it should be possible to identify
>> when things changed. I won't have time to look for some time, but if
>> someone else wanted to nail this down that would be useful.
>
>I can't quite tell from Tim's script what he's documenting. In R-2.15.3 I have
>
> > Rprofmem(); Rprofmem(NULL); readLines("Rprofmem.out", warn=FALSE)
>character(0)
>
>(or sometimes [1] "new page:new page:\"Rprofmem\" ")
>
>whereas in R-3.0.0
>
> > Rprofmem(); Rprofmem(NULL); readLines("Rprofmem.out", warn=FALSE)
>[1] "320040 :80040 :240048 :320040 :80040 :240048 :"
>
>I think these are the allocations Tim is seeing. They're from the parser (see
>below) rather than as.data.frame. For Tim's example
>
>   y <- 1:10^4 + 0.0
>   Rprofmem(); d <- as.data.frame(y); Rprofmem(NULL); readLines("Rprofmem.out")
>
>[1] "320040 :80040 :240048 :320040 :80040 :240048 :80040
>:\"as.data.frame.numeric\" \"as.data.frame\" "
>[2] "320040 :80040 :240048 :320040 :80040 :240048 :"
>
>only the allocation 80040 is from as.data.frame (from the call stack output).
>
>Under R -d gdb
>
>   (gdb) b R_OutputStackTrace
>   (gdb) r
>   > Rprofmem(); Rprofmem(NULL)
>
>   Breakpoint 1, R_OutputStackTrace (file=0xbd43f0) at
>/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
>   3434	{
>   (gdb) bt
>   #0  R_OutputStackTrace (file=0xbd43f0) at
>/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
>   #1  0x00007ffff792ff83 in R_ReportAllocation (size=320040) at
>/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3456
>   #2  Rf_allocVector (type=13, length=80000) at
>/home/mtmorgan/src/R-3-0-branch/src/main/memory.c:2478
>   #3  0x00007ffff790bedf in growData () at gram.y:3391
>
>and the memory allocations are from these lines in the parser gram.y
>
>	PROTECT( bigger = allocVector( INTSXP, data_size * DATA_ROWS ) ) ;
>	PROTECT( biggertext = allocVector( STRSXP, data_size ) );
>
>I'm not sure why these show up under R 3.0.0, though.
>
>$ R-2-15-branch/bin/R --version
>R version 2.15.3 Patched (2013-03-13 r62579) -- "Security Blanket"
>Copyright (C) 2013 The R Foundation for Statistical Computing
>ISBN 3-900051-07-0
>Platform: x86_64-unknown-linux-gnu (64-bit)
>
>R-3-0-branch$ bin/R --version
>R version 3.0.0 Patched (2013-04-14 r62579) -- "Masked Marvel"
>Copyright (C) 2013 The R Foundation for Statistical Computing
>Platform: x86_64-unknown-linux-gnu (64-bit)
>
>Martin
>
>
>
>>
>> Best,
>>
>> luke
>>
>> On Sun, 14 Apr 2013, Tim Hesterberg wrote:
>>
>>> I did some benchmarking of data frame code, and
>>> it appears that R 3.0.0 is far worse than earlier versions of R
>>> in terms of how many large objects it allocates space for,
>>> for data frame operations - creation, subscripting, subscript replacement.
>>> For a data frame with n rows, it makes either 2 or 4 extra copies of
>>> all of:
>>>        8n bytes (e.g. double precision)
>>>        24n bytes
>>>        32n bytes
>>> E.g., for as.data.frame(numeric vector), instead of allocations
>>> totalling ~8n bytes, it allocates 33 times that much.
>>>
>>> Here, compare columns 3 and 5
>>> (columns 2 and 4 are with the dataframe package).
>>>
>>> # Summary
>>> #                               R-2.14.2        R-2.15.3        R-3.0.0
>>> #                               w/o     with    w/o     with    w/o
>>> #       as.data.frame(y)        3       1       1       1       5;4;4
>>> #       data.frame(y)           7       3       4       2       6;2;2
>>> #       data.frame(y, z)        7 each  3 each  4       2       8;4;4
>>> #       as.data.frame(l)        8       3       5       2       9;4;4
>>> #       data.frame(l)           13      5       8       3       12;4;4
>>> #       d$z <- z                3,2     1,1     3,1     2,1     7;4;4,1
>>> #       d[["z"]] <- z           4,3     1,1     3,1     2,1     7;4;4,1
>>> #       d[, "z"] <- z           6,4,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
>>> #       d["z"] <- z             6,5,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
>>> #       d["z"] <- list(z=z)     6,3,2   2,2,1   4,2,2   3,2,1   8;4;4,2,2
>>> #       d["z"] <- Z #list(z=z)  6,2,2   2,1,1   4,1,2   3,1,1   8;4;4,1,2
>>> #       a <- d["y"]             2       1       2       1       6;4;4
>>> #       a <- d[, "y", drop=F]   2       1       2       1       6;4;4
>>>
>>> # Where two numbers are given, they refer to:
>>> #   (copies of the old data frame),
>>> #   (copies of the new column)
>>> # A third number refers to numbers of
>>> #   (copies made of an integer vector of row names)
>>>
>>> # For R 3.0.0, I'm getting astounding results - many more copies,
>>> # and also some copies of larger objects; in addition to the data
>>> # vectors of size 80K and 160K, also 240K and 320K.
>>> # Where three numbers are given in form a;c;d, they refer to
>>> #   (copies of 80K; 240K; 320K)
>>>
>>> The benchmarks are at
>>> http://www.timhesterberg.net/r-packages/memory.R
>>>
>>> I'm using versions of R I installed from source on a Linux box, using e.g.
>>> ./configure --prefix=(my path) --enable-memory-profiling --with-readline=no
>>> make
>>> make install
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>
>--
>Computational Biology / Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N.
>PO Box 19024 Seattle, WA 98109
>
>Location: Arnold Building M1 B861
>Phone: (206) 667-2793



More information about the R-devel mailing list