[R] loop vs. apply(): strange behavior with data frame?

Jim Holtman jholtman at gmail.com
Thu Oct 22 06:05:36 CEST 2009


try running Rprof on the two examples to see what the difference is.  
what you will probably see is a lot of the time on the dataframe is  
spent in accessing it like a matrix ('['). Rprof is very helpful to  
see where time is spent in your scripts.

Sent from my iPhone

On Oct 21, 2009, at 17:17, Roberto Perdisci  
<roberto.perdisci at gmail.com> wrote:

> Hi everybody,
>  I noticed a strange behavior when using loops versus apply() on a  
> data frame.
> The example below "explicitly" computes a distance matrix given a
> dataset. When the dataset is a matrix, everything works fine. But when
> the dataset is a data.frame, the dist.for function written using
> nested loops will take a lot longer than the dist.apply
>
> ######## USING FOR #######
>
> dist.for <- function(data) {
>
>  d <- matrix(0,nrow=nrow(data),ncol=nrow(data))
>  n <- ncol(data)
>  r <- nrow(data)
>
>  for(i in 1:r) {
>     for(j in 1:r) {
>        d[i,j] <- sum(abs(data[i,]-data[j,]))/n
>     }
>  }
>
>  return(as.dist(d))
> }
>
> ######## USING APPLY #######
>
> f <- function(data.row,data.rest) {
>
>  r2 <- as.double(apply(data.rest,1,g,data.row))
>
> }
>
> g <- function(row2,row1) {
>  return(sum(abs(row1-row2))/length(row1))
> }
>
> dist.apply <- function(data) {
>  d <- apply(data,1,f,data)
>
>  return(as.dist(d))
> }
>
>
> ######## TESTING #######
>
> library(mvtnorm)
> data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10))
>
> tf <- system.time(df <- dist.for(data))
> ta <- system.time(da <- dist.apply(data))
>
> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
> print("tf = ")
> print(tf)
> print("ta = ")
> print(ta)
>
> print('----------------------------------')
> print('Same experiment on data.frame...')
> data2 <- as.data.frame(data)
>
> tf <- system.time(df <- dist.for(data2))
> ta <- system.time(da <- dist.apply(data2))
>
> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
> print("tf = ")
> print(tf)
> print("ta = ")
> print(ta)
>
> ########################
>
> Here is the output I get on my system (R version 2.7.1 on a Debian  
> lenny)
>
> [1] "diff =  0"
> [1] "tf = "
>   user  system elapsed
>  0.088   0.000   0.087
> [1] "ta = "
>   user  system elapsed
>  0.128   0.000   0.128
> [1] "----------------------------------"
> [1] "Same experiment on data.frame..."
> [1] "diff =  0"
> [1] "tf = "
>   user  system elapsed
> 35.031   0.000  35.029
> [1] "ta = "
>   user  system elapsed
>  0.184   0.000   0.185
>
> Could you explain why that happens?
>
> thank you,
> regards
>
> Roberto
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list