[Rd] extracting rows from a data frame by looping over the row names: performance issues

Greg Snow Greg.Snow at intermountainmail.org
Fri Mar 2 20:51:05 CET 2007


Your 2 examples have 2 differences and they are therefore confounded in
their effects.

What are your results for:

system.time(for (i in 1:100) {row <-  dat[i, ] })



-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 
 

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Herve Pages
> Sent: Friday, March 02, 2007 11:40 AM
> To: r-devel at r-project.org
> Subject: [Rd] extracting rows from a data frame by looping 
> over the row names: performance issues
> 
> Hi,
> 
> 
> I have a big data frame:
> 
>   > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>   > dat <- as.data.frame(mat)
> 
> and I need to do some computation on each row. Currently I'm 
> doing this:
> 
>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do 
> some computation on row... }
> 
> which could probably considered a very natural (and R'ish) 
> way of doing it (but maybe I'm wrong and the real idiom for 
> doing this is something different).
> 
> The problem with this "idiomatic form" is that it is _very_ 
> slow. The loop itself + the simple extraction of the rows (no 
> computation on the rows) takes 10 hours on a powerful server 
> (quad core Linux with 8G of RAM)!
> 
> Looping over the first 100 rows takes 12 seconds:
> 
>   > system.time(for (key in row.names(dat)[1:100]) { row <- 
> dat[key, ] })
>      user  system elapsed
>    12.637   0.120  12.756
> 
> But if, instead of the above, I do this:
> 
>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
> 
> then it's 20 times faster!!
> 
>   > system.time(for (i in 1:100) { row <- sapply(dat, 
> function(col) col[i]) })
>      user  system elapsed
>     0.576   0.096   0.673
> 
> I hope you will agree that this second form is much less natural.
> 
> So I was wondering why the "idiomatic form" is so slow? 
> Shouldn't the idiomatic form be, not only elegant and easy to 
> read, but also efficient?
> 
> 
> Thanks,
> H.
> 
> 
> > sessionInfo()
> R version 2.5.0 Under development (unstable) (2007-01-05 
> r40386) x86_64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_
> MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_A
> DDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] "stats"     "graphics"  "grDevices" "utils"     
> "datasets"  "methods"
> [7] "base"
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list