[Rd] [R] variance/mean

Martin Maechler maechler at stat.math.ethz.ch
Wed Apr 1 10:09:33 CEST 2009


>>>>> Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>>     on Tue, 24 Mar 2009 00:39:58 +0100 writes:

    > (this post suggests a patch to the sources, so i allow myself to divert
    > it to r-devel)

    > Bert Gunter wrote:
    >> x a numeric vector, matrix or data frame. 
    >> y NULL (default) or a vector, matrix or data frame with compatible
    >> dimensions to x. The default is equivalent to y = x (but more efficient). 
    >> 
    >> 
    > bert points to an interesting fragment of ?var:  it suggests that
    > computing var(x) is more efficient than computing var(x,x), for any x
    > valid as input to var.  indeed:

    > set.seed(0)
    > x = matrix(rnorm(10000), 100, 100)

    > library(rbenchmark)
    > benchmark(replications=1000, columns=c('test', 'elapsed'),
    > var(x),
    > var(x, x))
    > #        test elapsed
    > # 1    var(x)   1.091
    > # 2 var(x, x)   2.051

    > that's of course, so to speak, unreasonable:  for what var(x) does is
    > actually computing the covariance of x and x, which should be the same
    > as var(x,x). 

    > the hack is that if y is given, there's an overhead of memory allocation
    > for *both* x and y when y is given, as seen in src/main/cov.c:720+.
    > incidentally, it seems that the problem can be solved with a trivial fix
    > (see the attached patch), so that

    > set.seed(0)
    > x = matrix(rnorm(10000), 100, 100)

    > library(rbenchmark)
    > benchmark(replications=1000, columns=c('test', 'elapsed'),
    > var(x),
    > var(x, x))
    > #        test elapsed
    > # 1    var(x)   1.121
    > # 2 var(x, x)   1.107

    > with the quick checks

    > all.equal(var(x), var(x, x))
    > # TRUE
   
    > all(var(x) == var(x, x))
    > # TRUE

    > and for cor it seems to make cor(x,x) slightly faster than cor(x), while
    > originally it was twice slower:

    > # original
    > benchmark(replications=1000, columns=c('test', 'elapsed'),
    > cor(x),
    > cor(x, x))
    > #        test elapsed
    > # 1    cor(x)   1.196
    > # 2 cor(x, x)   2.253
   
    > # patched
    > benchmark(replications=1000, columns=c('test', 'elapsed'),
    > cor(x),
    > cor(x, x))
    > #        test elapsed
    > # 1    cor(x)   1.207
    > # 2 cor(x, x)   1.204

    > (there is a visible penalty due to an additional pointer test, but it's
    > 10ms on 1000 replications with 10000 data points, which i think is
    > negligible.)

    >> This is as clear as I would know how to state. 

    > i believe bert is right.

    > however, with the above fix, this can now be rewritten as:

    > "
    > x: a numeric vector, matrix or data frame. 
    > y: a vector, matrix or data frame with dimensions compatible to those of x. 
    > By default, y = x. 
    > "

    > which, to my simple mind, is even more clear than what bert would know
    > how to state, and less likely to cause the sort of confusion that
    > originated this thread.

Your patch is basically only affecting the default  
method = "pearson". For (most) other cases, 'y = NULL' would
still remain  *the* way to save computations, unless we'd start
to use an R-level equivalent [which I think does not exist] of
your C  trick   (DATAPTR(x) == DATAPTR(y)).

Also, for S- and R- backcompatibility reasons, we'd need to
continue allowing  y = NULL (as your patch would, too), so
currently I think this whole idea -- as slick as it is, I
learned something!  --  
does not make sense applying here.

    > the attached patch suggests modifications to src/main/cov.c and
    > src/library/stats/man/cor.Rd.

BTW: since you didn't (and shouldn't , because of method != "pearson" !) 
 change the R code, the docs  \usage{.} part should not have been
 changed either ! 
 and as I mentioned: using 'y = NULL' in the function call must
 continue to work, hence should also be documented as
 possibility
 ==>  the docs would not really become more clear, I think 

Martin Maechler, ETH Zurich



    > it has been prepared and checked as follows:

    > svn co https://svn.r-project.org/R/trunk trunk
    > cd trunk
    > # edited the sources
    > svn diff > cov.diff
    > svn revert -R src
    > patch -p0 < cov.diff

    > tools/rsync-recommended
    > ./configure
    > make
    > make check
    > bin/R
    > # subsequent testing within R

    > if you happen to consider this patch for a commit, please be sure to
    > examine and test it carefully first.

    > vQ
    > Content-Type: text/x-diff; name="cov.diff"
    > Content-Disposition: inline; filename="cov.diff"
    > Content-ID: <18899.7024.520234.153929 at lynne.math.ethz.ch>
    > Content-Transfer-Encoding: binary

    > [Deleted text/x-diff]

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list