[R] practical to loop over 2million rows?

Thu Oct 11 04:24:52 CEST 2012

On Oct 10, 2012, at 6:45 PM, jim holtman wrote:

> This is a classic example from my tag line:
> 
> Tell me what  you want to do, not how you want to do it.
> 
> For example you provided no information as to what the objects were.
> I hope that 'stratID' is at least of length one greater than 'x' based
> on your loops.  Also on the last iteration you are trying to access an
> element outside of x (x[length(x) + 1]).
> 
> The first part is easy for setting 'y'
> 
> indx <- !is.na(x)
> y[indx] <- x[indx]
> 
That's perhaps faster than the approach I offered because if only uses is.na(x) once.

> For the second part you can do something like:
> 
> indx <- head(stratID, -1) == tail(stratID, -1)  # get the comparison

Jay; if you have not figured it out yet, Jim Holtman is one of premier data-meisters around here. You could probably write an excellent book simply by going to the Archives and pasting together all the elegant solutions he has provided over the years. His moniker 'Data Munger Guru' is well deserved.

> 
> but since you did not provide any data, the rest is left to the reader.

"You" meaning Jay. (At least I hope that is what Jim meant.) 

Jim; Perhaps you tagline should say: "Tell me what you have, and only then, what you want to do with it."

-- 
David.
> 
> On Wed, Oct 10, 2012 at 5:16 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>> 
>> On Oct 10, 2012, at 1:31 PM, Jay Rice wrote:
>> 
>>> New to R and having issues with loops. I am aware that I should use
>>> vectorization whenever possible and use the apply functions, however,
>>> sometimes a loop seems necessary.
>>> 
>>> I have a data set of 2 million rows and have tried run a couple of loops of
>>> varying complexity to test efficiency. If I do a very simple loop such as
>>> add every item in a column I get an answer quickly.
>>> 
>>> If I use a nested ifelse statement in a loop it takes me 13 minutes to get
>>> an answer on just 50,000 rows. I am aware of a few methods to speed up
>>> loops. Preallocating memory space and compute as much outside of the loop
>>> as possible (or use create functions and just loop over the function) but
>>> it seems that even with these speed ups I might have too much data to run
>>> loops.  Here is the loop I ran that took 13 minutes. I realize I can
>>> accomplish the same goal using vectorization (and in fact did so).
>> 
>> You should describe what you want to do and you should learn to use the vectorized capabilities of R  and leave the for-loops for process that really need them
>> 
>> 
>>> 
>>> y<-numeric(length(x))
>>> 
>>> for(i in 1:length(x))
>>> 
>>> ifelse(!is.na(x[i]), y[i]<-x[i],
>> 
>> Instead :
>> 
>> y[!is.na(x)] <- x[!is.na(x)]  # No loop.
>> 
>> 
>>> 
>>> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
>> 
>> When you index outside the range of the length of x you get NA as a result. Furthermore you are setting y to be only a single element. So I think 'y' will be a single NA at the end of all this.
>> 
>>> strataID <- sample(1:2, 10, repl=TRUE)
>>> strataID
>> [1] 1 1 2 2 1 2 2 2 2 1
>> 
>>> for(i in 1:length(x)) {ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1])}
>>> y
>> [1] NA
>> 
>> There is no implicit indexing of the LHS of an assignment operation. How long is strataID? And why not do this inside a dataframe?
>> 
>>> 
>>> Presumably, complicated loops would be more intensive than the nested if
>>> statement above. If I write more efficient loops time will come down but I
>>> wonder if I will ever be able to write efficient enough code to perform a
>>> complicated loop over 2 million rows in a reasonable time.
>>> 
>>> Is it useless for me to try to do any complicated loops on 2 million rows,
>>> or if I get much better at programming in R will it be manageable even for
>>> complicated situations?
>>> 
>> 
>> You will gain efficiency when you learn vectorization. And when you learn to test your code for correct behavior.
>> 
>>> 
>>> Jay
>>> 
>>>      [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> David Winsemius, MD
>> Alameda, CA, USA
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> -- 
> Jim Holtman
> Data Munger Guru
> 
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.

David Winsemius, MD
Alameda, CA, USA