[R] practical to loop over 2million rows?

David Winsemius dwinsemius at comcast.net
Wed Oct 10 23:16:52 CEST 2012


On Oct 10, 2012, at 1:31 PM, Jay Rice wrote:

> New to R and having issues with loops. I am aware that I should use
> vectorization whenever possible and use the apply functions, however,
> sometimes a loop seems necessary.
> 
> I have a data set of 2 million rows and have tried run a couple of loops of
> varying complexity to test efficiency. If I do a very simple loop such as
> add every item in a column I get an answer quickly.
> 
> If I use a nested ifelse statement in a loop it takes me 13 minutes to get
> an answer on just 50,000 rows. I am aware of a few methods to speed up
> loops. Preallocating memory space and compute as much outside of the loop
> as possible (or use create functions and just loop over the function) but
> it seems that even with these speed ups I might have too much data to run
> loops.  Here is the loop I ran that took 13 minutes. I realize I can
> accomplish the same goal using vectorization (and in fact did so).

You should describe what you want to do and you should learn to use the vectorized capabilities of R  and leave the for-loops for process that really need them


> 
> y<-numeric(length(x))
> 
> for(i in 1:length(x))
> 
> ifelse(!is.na(x[i]), y[i]<-x[i],

Instead :

y[!is.na(x)] <- x[!is.na(x)]  # No loop.


> 
> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))

When you index outside the range of the length of x you get NA as a result. Furthermore you are setting y to be only a single element. So I think 'y' will be a single NA at the end of all this.

> strataID <- sample(1:2, 10, repl=TRUE)
> strataID
 [1] 1 1 2 2 1 2 2 2 2 1

> for(i in 1:length(x)) {ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1])}
> y
[1] NA

 There is no implicit indexing of the LHS of an assignment operation. How long is strataID? And why not do this inside a dataframe?

> 
> Presumably, complicated loops would be more intensive than the nested if
> statement above. If I write more efficient loops time will come down but I
> wonder if I will ever be able to write efficient enough code to perform a
> complicated loop over 2 million rows in a reasonable time.
> 
> Is it useless for me to try to do any complicated loops on 2 million rows,
> or if I get much better at programming in R will it be manageable even for
> complicated situations?
> 

You will gain efficiency when you learn vectorization. And when you learn to test your code for correct behavior.

> 
> Jay
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA




More information about the R-help mailing list