[R] loop vs. apply(): strange behavior with data frame?

Thu Oct 22 20:01:25 CEST 2009

Thanks for the suggestion.
   I found some documentation on why accessing a data.gram using the
matrix notation (e.g., [i,j]) is so expensive, which was the cause of
the problem.

regards,

Roberto

On Thu, Oct 22, 2009 at 12:05 AM, Jim Holtman <jholtman at gmail.com> wrote:
> try running Rprof on the two examples to see what the difference is. what
> you will probably see is a lot of the time on the dataframe is spent in
> accessing it like a matrix ('['). Rprof is very helpful to see where time is
> spent in your scripts.
>
> Sent from my iPhone
>
> On Oct 21, 2009, at 17:17, Roberto Perdisci <roberto.perdisci at gmail.com>
> wrote:
>
>> Hi everybody,
>>  I noticed a strange behavior when using loops versus apply() on a data
>> frame.
>> The example below "explicitly" computes a distance matrix given a
>> dataset. When the dataset is a matrix, everything works fine. But when
>> the dataset is a data.frame, the dist.for function written using
>> nested loops will take a lot longer than the dist.apply
>>
>> ######## USING FOR #######
>>
>> dist.for <- function(data) {
>>
>>  d <- matrix(0,nrow=nrow(data),ncol=nrow(data))
>>  n <- ncol(data)
>>  r <- nrow(data)
>>
>>  for(i in 1:r) {
>>    for(j in 1:r) {
>>       d[i,j] <- sum(abs(data[i,]-data[j,]))/n
>>    }
>>  }
>>
>>  return(as.dist(d))
>> }
>>
>> ######## USING APPLY #######
>>
>> f <- function(data.row,data.rest) {
>>
>>  r2 <- as.double(apply(data.rest,1,g,data.row))
>>
>> }
>>
>> g <- function(row2,row1) {
>>  return(sum(abs(row1-row2))/length(row1))
>> }
>>
>> dist.apply <- function(data) {
>>  d <- apply(data,1,f,data)
>>
>>  return(as.dist(d))
>> }
>>
>>
>> ######## TESTING #######
>>
>> library(mvtnorm)
>> data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10))
>>
>> tf <- system.time(df <- dist.for(data))
>> ta <- system.time(da <- dist.apply(data))
>>
>> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
>> print("tf = ")
>> print(tf)
>> print("ta = ")
>> print(ta)
>>
>> print('----------------------------------')
>> print('Same experiment on data.frame...')
>> data2 <- as.data.frame(data)
>>
>> tf <- system.time(df <- dist.for(data2))
>> ta <- system.time(da <- dist.apply(data2))
>>
>> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
>> print("tf = ")
>> print(tf)
>> print("ta = ")
>> print(ta)
>>
>> ########################
>>
>> Here is the output I get on my system (R version 2.7.1 on a Debian lenny)
>>
>> [1] "diff =  0"
>> [1] "tf = "
>>  user  system elapsed
>>  0.088   0.000   0.087
>> [1] "ta = "
>>  user  system elapsed
>>  0.128   0.000   0.128
>> [1] "----------------------------------"
>> [1] "Same experiment on data.frame..."
>> [1] "diff =  0"
>> [1] "tf = "
>>  user  system elapsed
>> 35.031   0.000  35.029
>> [1] "ta = "
>>  user  system elapsed
>>  0.184   0.000   0.185
>>
>> Could you explain why that happens?
>>
>> thank you,
>> regards
>>
>> Roberto
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>