[R] Avoiding copies in list assignments

Peter Langfelder peter.langfelder at gmail.com
Sat Aug 24 00:42:56 CEST 2013


One more question about avoiding copies when modifying lists. I would
like to call a function (call it 'f') that does an operation on a
large array according to a given index. For example

f = function(data, index) sum(data[index])

The idea is to repeatedly call f() with the same 'data' but different
'index' arguments. For reasons I won't get into I need to call the
function via a do.call, so I create a list that will hold the
arguments and call the function repeatedly via do.call, as in this
rather trivial example:


> n = 2e8;
> set.seed(1);
> x = rnorm(n);
> gc();
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    182412    9.8     407500   21.8    350000   18.7
Vcells 200278475 1528.1  221144237 1687.2 200519577 1529.9

## x takes roughly 1.5GB, which makes sense

> args = list(data = x);
>
> gc();
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    182422    9.8     407500   21.8    350000   18.7
Vcells 400278489 3053.9  441644452 3369.5 400598513 3056.4

## Here x seems to have been copied since memory usage doubled

>
> system.time( {
+   for (i in 1:4)
+   {
+     args$index = i:(10+3*i)
+     do.call(f, args);
+     print(gc())
+   }
+ })
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    182900    9.8     407500   21.8    350000   18.7
Vcells 400279034 3053.9  487077007 3716.2 401240778 3061.3
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    182994    9.8     407500   21.8    350000   18.7
Vcells 400279163 3053.9  630538264 4810.7 600279205 4579.8
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    182994    9.8     407500   21.8    350000   18.7
Vcells 400279171 3053.9  630538264 4810.7 600279358 4579.8
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    182994    9.8     407500   21.8    350000   18.7
Vcells 400279171 3053.9  630538264 4810.7 600279376 4579.8
   user  system elapsed
  0.808   0.617   1.447

In the second iteration the interpreter apparently needed one more
(temporary) copy of x since max used memory went up by 1.5GB again.

Note also that the timing indicates that a lot of time was spent copying memory.

This code can of course be written by calling f directly: start a new
session and use the code

> f = function(data, index) sum(data[index])
> n = 2e8;
> set.seed(1);
> x = rnorm(n);
> gc();
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    182412    9.8     407500   21.8    350000   18.7
Vcells 200278475 1528.1  221144237 1687.2 200519577 1529.9
>
> system.time( {
+ for (i in 1:4)
+ {
+   index = i:(10+3*i)
+   f(x, index)
+   print(gc())
+ }
+ })
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    183320    9.8     407500   21.8    350000   18.7
Vcells 200279810 1528.1  243975520 1861.4 201806004 1539.7
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    183414    9.8     407500   21.8    350000   18.7
Vcells 200279939 1528.1  256254296 1955.1 201806004 1539.7
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    183414    9.8     407500   21.8    350000   18.7
Vcells 200279947 1528.1  269147010 2053.5 201806004 1539.7
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    183414    9.8     407500   21.8    350000   18.7
Vcells 200279947 1528.1  282684360 2156.8 201806004 1539.7
   user  system elapsed
  0.059   0.000   0.060


Here x was not copied, and execution time is down by a huge factor.

My question is, can the list operations be made more efficient or can
one use the do.call construct or something equivalent without having
all these extra copies and the memory and time overhead they incur?

Thanks,

Peter

> sessionInfo()
R version 3.0.1 Patched (2013-06-26 r63071)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



More information about the R-help mailing list