[Rd] Unnecessary extra copy with matrix(..., dimnames=NULL) (Was: Re: modifying large R objects in place)

Henrik Bengtsson hb at stat.berkeley.edu
Thu Sep 27 20:30:34 CEST 2007


As others already mentioned, in your example you are first creating an
integer matrix and the coercing it to a double matrix by assigning
(double) 1 to element [1,1].  However, even when correcting for this
mistake, there is an extra copy created when using matrix().

Try this in a fresh vanilla R session:

> print(gc())
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 136684  3.7     350000  9.4   350000  9.4
Vcells  81026  0.7     786432  6.0   473127  3.7
> x <- matrix(1, nrow=5000, ncol=5000)
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136793   3.7     350000   9.4   350000   9.4
Vcells 25081043 191.4   27989266 213.6 25081056 191.4
> x[1,1] <- 2
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136797   3.7     350000   9.4   350000   9.4
Vcells 25081044 191.4   52830254 403.1 50081058 382.1

So, yes, in that x[1,1] <- 2 assignment an extra copy is created.  It
is related to to the fact that there is NAMED matrix object being
created inside matrix(), cf. the last rows in matrix():

    x <- .Internal(matrix(data, nrow, ncol, byrow))
    dimnames(x) <- dimnames
    x

Here is a patch for matrix() that avoids this problem *when dimnames
is NULL* (which is many time the case):

matrix <- function(data=NA, nrow=1, ncol=1, byrow=FALSE, dimnames=NULL) {
  data <- as.vector(data);

  if(missing(nrow)) {
    nrow <- ceiling(length(data)/ncol);
  } else if(missing(ncol)) {
    ncol <- ceiling(length(data)/nrow);
  }

  # Trick to avoid extra copy in the case when 'dimnames' is NULL.
  if (is.null(dimnames)) {
    .Internal(matrix(data, nrow, ncol, byrow));
  } else {
    x <- .Internal(matrix(data, nrow, ncol, byrow));
    dimnames(x) <- dimnames;
    x;
  }
} # matrix()


Try the above again in a fresh R session with this patch applied and you'll get:

> print(gc())
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 136805  3.7     350000  9.4   350000  9.4
Vcells  81122  0.7     786432  6.0   473127  3.7
> x <- matrix(1, nrow=5000, ncol=5000)
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136919   3.7     350000   9.4   350000   9.4
Vcells 25081139 191.4   27989372 213.6 25081152 191.4
> x[1,1] <- 2
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136923   3.7     350000   9.4   350000   9.4
Vcells 25081140 191.4   29468840 224.9 25081276 191.4

Voila!

I talked to Luke Tierney about this and he though the internal method
should be updated to take the dimnames argument, i.e.
.Internal(matrix(data, nrow, ncol, byrow, dimnames)).  However, until
that is happening, may I suggest this simple patch/workaround to go in
R v2.6.0?

Cheers

Henrik


On 9/27/07, Petr Savicky <savicky at cs.cas.cz> wrote:
> On Wed, Sep 26, 2007 at 10:52:28AM -0700, Byron Ellis wrote:
> > For the most part, doing anything to an R object result in it's
> > duplication. You generally have to do a lot of work to NOT copy an R
> > object.
>
> Thank you for your response. Unfortunately, you are right. For example,
> the allocated memory determined by top command on Linux may change during
> a session as follows:
>   a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m
>   a[1,1] <- 0 # 3.0g
>   gc() # 1.5g
>
> In the current applicatin, I modify the matrix only using my own C code
> and only read it on R level. So, the above is not a big problem for me
> (at least not now).
>
> However, there is a related thing, which could be a bug. The following
> code determines the value of NAMED field in SEXP header of an object:
>
>   SEXP getnamed(SEXP a)
>   {
>       SEXP out;
>       PROTECT(out = allocVector(INTSXP, 1));
>       INTEGER(out)[0] = NAMED(a);
>       UNPROTECT(1);
>       return(out);
>   }
>
> Now, consider the following session
>
>   u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
>   .Call("getnamed",u) # 1 (OK)
>
>   length(u)
>   .Call("getnamed",u) # 1 (OK)
>
>   dim(u)
>   .Call("getnamed",u) # 1 (OK)
>
>   nrow(u)
>   .Call("getnamed",u) # 2 (why?)
>
>   u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
>   .Call("getnamed",u) # 1 (OK)
>   ncol(u)
>   .Call("getnamed",u) # 2 (so, ncol does the same)
>
> Is this a bug?
>
> Petr Savicky.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list