[Rd] data frame subset patch, take 2

Tue Dec 12 18:08:01 CET 2006

>>>>> "Marcus" == Marcus G Daniels <mgd at santafe.edu>
>>>>>     on Tue, 12 Dec 2006 09:05:15 -0700 writes:

    Marcus> Vladimir Dergachev wrote:
    >> Here is the second iteration of data frame subset patch.
    >> It now passes make check on both 2.4.0 and 2.5.0 (svn as
    >> of a few days ago).  Same speedup as before.
    >> 
    Marcus> Hi,

    Marcus> I was wondering if this patch would make it into the
    Marcus> next release.  I don't see it in SVN, but it's hard
    Marcus> to be sure because the mailing list apparently
    Marcus> strips attachments.  If it isn't in, or going to be
    Marcus> in, is this patch available somewhere else?

I was wondering too.
      http://www.r-project.org/mail.html
explains what kind of attachments are allowed on R-devel.

I'm particularly interested, since during the last several days
I've made (somewhat experimental) changes to R-devel,
which makes some dealings with large data frames that have
"trivial rownames" (those represented as  1:nrow(.))
much more efficient.

Notably, as.matrix() of such data frames now no longer produces
huge row names, and e.g.  dim(.) of such data frames has become
lightning fast [compared to what it was].

Some measurements:

N <- 1e6
set.seed(1)
## we round (for later dump().. reasons)
x <- round(rnorm(N),2)
y <- round(rnorm(N),2)
mOrig <- cbind(x = x, y = y)
df <- data.frame(x = x, y = y)
mNew <- as.matrix(df)
(sizes <- sapply(list(mOrig=mOrig, df=df, mNew=mNew), object.size))
## R-2.4.0 (64-bit):
##    mOrig       df     mNew
## 16000520 16000776 72000560

## R-2.4.1 beta (32-bit):
##    mOrig       df     mNew
## 16000296 16000448 52000320

## R-pre-2.5.0 (32-bit):
##    mOrig       df     mNew
## 16000296 16000448 16000296

##------------------------------------

N <- 1e6
df <- data.frame(x = 0+ 1:N, y = 1+ 1:N)
system.time(for(i in 1:1000) d <- dim(df))

## R-2.4.1 beta (32-bit) [deb1]:
## [1] 1.920 3.748 7.810 0.000 0.000

## R-pre-2.5.0 (32-bit) [deb1]:
##    user  system elapsed
##   0.012   0.000   0.011

--- --- --- --- --- --- --- --- --- --- 

However, currently

  df[2,] ## still internally produces the  character(1e6)  row names!

something I think we should eliminate as well,
i.e., at least make sure that only  seq_len(1e6) is internally
produced and not the character vector.

Note however that some of these changes are backward
incompatible. I do hope that the changes gaining efficiency
for such large data frames are worth some adaption of
current/old R source code..

Feedback on this topic is very welcome!

Martin