[R] Loops and dataframes

Liaw, Andy andy_liaw at merck.com
Fri Feb 25 12:28:44 CET 2005


You are discovering part of the overhead of using a data frame.  The way you
specify the subset of data frame to replace matters somewhat:

> st <- rep(1,1e4)
> ed <- rep(2,1e4)
> df <- data.frame(start=st, end=ed)
> system.time(for (i in 1:dim(df)[1]) df[i,1] <- df[i,2], gcFirst=TRUE)
[1] 35.96  0.10 36.37    NA    NA
> df <- data.frame(start=st, end=ed)
> system.time(for (i in 1:dim(df)[1]) df[[1]][i] <- df[[2]][i],
gcFirst=TRUE)
[1] 22.63  0.17 22.88    NA    NA
> df <- data.frame(start=st, end=ed)
> system.time(for (i in 1:dim(df)[1]) df$start[i] <- df$end[i],
gcFirst=TRUE)
[1] 19.29  0.13 19.46    NA    NA


If you have all numeric data, you might as well use a matrix instead of data
frame:

> m <- cbind(start=st, end=ed)
> str(m)
 num [1:10000, 1:2] 2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "start" "end"
> system.time(for (i in 1:nrow(df)) m[i,1] <- m[i,2], gcFirst=TRUE)
[1] 0.06 0.00 0.08   NA   NA


Andy


> From: Firas Swidan
> 
> Hi,
> I am experiencing a long delay when using dataframes inside 
> loops and was
> wordering if this is a bug or not.
> Example code:
> 
> > st <- rep(1,100000)
> > ed <- rep(2,100000)
> > for(i in 1:length(st)) st[i] <- ed[i] # works fine
> > df <- data.frame(start=st,end=ed)
> > for(i in 1:dim(df)[1]) df[i,1] <- df[i,2] #takes for ever
> 
> R: R 2.0.0 (2004-10-04)
> OS: Linux, Fedora Core 2
> kernel: 2.6.10-1.14_FC2
> cpu: AMD Athlon XP 1600.
> mem: 500MB.
> 
> The example above is only to illustrate the problem. I need 
> loops to apply
> some functions on pairs (not necessarily successive) of rows in a
> dataframe.
> 
> Thankful for any advices,
> Firas.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list