[Rd] Understanding an R improvement that already occurred.

Henrik Bengtsson hb at stat.berkeley.edu
Wed Jan 30 16:53:47 CET 2008


On Jan 30, 2008 7:20 AM, Jay Emerson <jayemerson at gmail.com> wrote:
> I was surprised to observe the following difference between 2.4.1 and
> 2.6.0 after a long overdue upgrade a few months ago of our
> departmental server.  It wasn't a bug fix, but a subtle improvement.
> Here's the simplest example I could create.  The size is excessive, on
> the order of the Netflix Competition data.
>
> The integer matrix is about 1.12 GB, and if coerced to numeric it is
> 2.24 GB.  The peak memory consumption of the first (old) operation was
> 1.2 + 2.24 + 2.24 = 5.6 GB.  The peak memory consumption of the second
> (new) operation is 1.12 + 2.24 = 3.36 GB.  (See below)
>
> In contrast, if a numeric matrix is used, there are no differences
> between the versions (so the improvement seems related to the integer
> type or the decision when/how to do the coercion).  And of course I
> realize that x <- x + as.integer(1) is an option, but that isn't the
> point of this exercise.
>
> I'm curious, but also spending time on memory-related work.  Someone
> deserves a 'thank you' and a pat on the pack for making this sort of
> improvement.  Surely someone can step forward and take a bow, and
> perhaps explain the nature of the improvement?
>
> On a related note, a new package bigmemoRy will be available soon,
> handling massive matrices of double, integer, short, or char in RAM.
> In Unix (sorry, Windows), these matrices can also be used with shared
> memory (with mutexes implemented) for parallel processing.  It's a
> niche market, obviously, ideal for data larger than 1 GB (roughly) but
> still within the boundaries of the RAM.  It may be a useful developer
> tool for big-data problems.
>
> ------------------------
> R version 2.4.1 (linux):
> > x <- matrix(as.integer(0), 1e+08, 3)
> > x <- x + 1
> > gc()
>            used   (Mb) gc trigger (Mb)  max used   (Mb)
> Ncells    233754   12.5     467875   25    350000   18.7
> Vcells 300119431 2289.8  787870506 6011 750119944 5723.0
> ------------------------
> R version 2.6.0 (linux):
> > x <- matrix(as.integer(0), 1e+08, 3)
> > x <- x + 1
> > gc()
>            used   (Mb) gc trigger   (Mb)  max used   (Mb)
> Ncells    137931    7.4     350000   18.7    350000   18.7
> Vcells 300126402 2289.8  472877829 3607.8 450126789 3434.2

That's interesting - I never noticed that change.  On the same topic,
in R 2.7.0 devel, the (re-)assignment in the following example does no
longer create an extra copy:

> x <- matrix(1, nrow=5000, ncol=5000)
gc()> gc()
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   132056   7.1     350000  18.7   350000  18.7
Vcells 25136968 191.8   28050871 214.1 25137357 191.8

> x[1,1] <- 2
> gc()
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   132060   7.1     350000  18.7   350000  18.7
Vcells 25136969 191.8   29533414 225.4 25137357 191.8

In R 2.6.1 that 2nd assignment would result in:

> x[1,1] <- 2
> gc()
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   138119   7.4     350000  18.7   350000  18.7
Vcells 25126464 191.7   52877950 403.5 50126482 382.5

See https://stat.ethz.ch/pipermail/r-devel/2007-September/047008.html
for background.

Thanks a lot whoever (Luke?) took the time to update matrix().

/Henrik

>
>
> --
> John W. Emerson (Jay)
> Assistant Professor of Statistics
> Director of Graduate Studies (on leave 07-08)
> Department of Statistics
> Yale University
> http://www.stat.yale.edu/~jay
> Statistical Consultant, REvolution Computing
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list