[Rd] extracting rows from a data frame by looping over the row names: performance issues

Greg Snow Greg.Snow at intermountainmail.org
Mon Mar 5 17:07:56 CET 2007


The difference is in indexing by row number vs. indexing by row name.
It has long been known that names slow matricies down, some routines
make a copy of dimnames of a matrix, remove the dimnames, do the
computations with the matrix, then put the dimnames back on.  This can
speed things up quite a bit in some circumstances.  For your example,
indexing by number means jumping to a specific offset in the matrix,
indexing by name means searching through all the names and doing string
comparisons to find the match.  A 300 fold difference in speed is not
suprising.



-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 
 

> -----Original Message-----
> From: Herve Pages [mailto:hpages at fhcrc.org] 
> Sent: Friday, March 02, 2007 7:04 PM
> To: Greg Snow
> Cc: r-devel at r-project.org
> Subject: Re: [Rd] extracting rows from a data frame by 
> looping over the row names: performance issues
> 
> Hi Greg,
> 
> Greg Snow wrote:
> > Your 2 examples have 2 differences and they are therefore 
> confounded 
> > in their effects.
> > 
> > What are your results for:
> > 
> > system.time(for (i in 1:100) {row <-  dat[i, ] })
> > 
> > 
> > 
> 
> Right. What you suggest is even faster (and more simple):
> 
>   > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>   > dat <- as.data.frame(mat)
> 
>   > system.time(for (key in row.names(dat)[1:100]) { row <- 
> dat[key, ] })
>      user  system elapsed
>    13.241   0.460  13.702
> 
>   > system.time(for (i in 1:100) { row <- sapply(dat, 
> function(col) col[i]) })
>      user  system elapsed
>     0.280   0.372   0.650
> 
>   > system.time(for (i in 1:100) {row <-  dat[i, ] })
>      user  system elapsed
>     0.044   0.088   0.130
> 
> So apparently here extracting with dat[i, ] is 300 times 
> faster than extracting with dat[key, ] !
> 
> > system.time(for (i in 1:100) dat["1", ])
>    user  system elapsed
>  12.680   0.396  13.075
> 
> > system.time(for (i in 1:100) dat[1, ])
>    user  system elapsed
>   0.060   0.076   0.137
> 
> Good to know!
> 
> Thanks a lot,
> H.
> 
> 
>



More information about the R-devel mailing list