[Rd] data frame subset patch, take 2

Tue Dec 12 18:41:50 CET 2006

Hi,
  I tried take 1, and it failed. I have been traveling (and with 
Martin's changes also waiting for things to stabilize) before trying 
take 2, probably later this week and I will send an email if it goes in. 
Anyone wanting to try it and run R through check and check-all is 
welcome to do so and report success or failure.

  best wishes
    Robert

Martin Maechler wrote:
>>>>>> "Marcus" == Marcus G Daniels <mgd at santafe.edu>
>>>>>>     on Tue, 12 Dec 2006 09:05:15 -0700 writes:
> 
>     Marcus> Vladimir Dergachev wrote:
>     >> Here is the second iteration of data frame subset patch.
>     >> It now passes make check on both 2.4.0 and 2.5.0 (svn as
>     >> of a few days ago).  Same speedup as before.
>     >> 
>     Marcus> Hi,
> 
>     Marcus> I was wondering if this patch would make it into the
>     Marcus> next release.  I don't see it in SVN, but it's hard
>     Marcus> to be sure because the mailing list apparently
>     Marcus> strips attachments.  If it isn't in, or going to be
>     Marcus> in, is this patch available somewhere else?
> 
> I was wondering too.
>       http://www.r-project.org/mail.html
> explains what kind of attachments are allowed on R-devel.
> 
> I'm particularly interested, since during the last several days
> I've made (somewhat experimental) changes to R-devel,
> which makes some dealings with large data frames that have
> "trivial rownames" (those represented as  1:nrow(.))
> much more efficient.
> 
> Notably, as.matrix() of such data frames now no longer produces
> huge row names, and e.g.  dim(.) of such data frames has become
> lightning fast [compared to what it was].
> 
> Some measurements:
> 
> N <- 1e6
> set.seed(1)
> ## we round (for later dump().. reasons)
> x <- round(rnorm(N),2)
> y <- round(rnorm(N),2)
> mOrig <- cbind(x = x, y = y)
> df <- data.frame(x = x, y = y)
> mNew <- as.matrix(df)
> (sizes <- sapply(list(mOrig=mOrig, df=df, mNew=mNew), object.size))
> ## R-2.4.0 (64-bit):
> ##    mOrig       df     mNew
> ## 16000520 16000776 72000560
> 
> ## R-2.4.1 beta (32-bit):
> ##    mOrig       df     mNew
> ## 16000296 16000448 52000320
> 
> ## R-pre-2.5.0 (32-bit):
> ##    mOrig       df     mNew
> ## 16000296 16000448 16000296
> 
> ##------------------------------------
> 
> N <- 1e6
> df <- data.frame(x = 0+ 1:N, y = 1+ 1:N)
> system.time(for(i in 1:1000) d <- dim(df))
> 
> ## R-2.4.1 beta (32-bit) [deb1]:
> ## [1] 1.920 3.748 7.810 0.000 0.000
> 
> ## R-pre-2.5.0 (32-bit) [deb1]:
> ##    user  system elapsed
> ##   0.012   0.000   0.011
> 
> 
> --- --- --- --- --- --- --- --- --- --- 
> 
> However, currently
> 
>   df[2,] ## still internally produces the  character(1e6)  row names!
> 
> something I think we should eliminate as well,
> i.e., at least make sure that only  seq_len(1e6) is internally
> produced and not the character vector.
> 
> Note however that some of these changes are backward
> incompatible. I do hope that the changes gaining efficiency
> for such large data frames are worth some adaption of
> current/old R source code..
> 
> Feedback on this topic is very welcome!
> 
> Martin
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org