# [R] practical to loop over 2million rows?

jim holtman jholtman at gmail.com
Thu Oct 11 03:45:15 CEST 2012

```This is a classic example from my tag line:

Tell me what  you want to do, not how you want to do it.

For example you provided no information as to what the objects were.
I hope that 'stratID' is at least of length one greater than 'x' based
on your loops.  Also on the last iteration you are trying to access an
element outside of x (x[length(x) + 1]).

The first part is easy for setting 'y'

indx <- !is.na(x)
y[indx] <- x[indx]

For the second part you can do something like:

indx <- head(stratID, -1) == tail(stratID, -1)  # get the comparison

but since you did not provide any data, the rest is left to the reader.

On Wed, Oct 10, 2012 at 5:16 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Oct 10, 2012, at 1:31 PM, Jay Rice wrote:
>
>> New to R and having issues with loops. I am aware that I should use
>> vectorization whenever possible and use the apply functions, however,
>> sometimes a loop seems necessary.
>>
>> I have a data set of 2 million rows and have tried run a couple of loops of
>> varying complexity to test efficiency. If I do a very simple loop such as
>> add every item in a column I get an answer quickly.
>>
>> If I use a nested ifelse statement in a loop it takes me 13 minutes to get
>> an answer on just 50,000 rows. I am aware of a few methods to speed up
>> loops. Preallocating memory space and compute as much outside of the loop
>> as possible (or use create functions and just loop over the function) but
>> it seems that even with these speed ups I might have too much data to run
>> loops.  Here is the loop I ran that took 13 minutes. I realize I can
>> accomplish the same goal using vectorization (and in fact did so).
>
> You should describe what you want to do and you should learn to use the vectorized capabilities of R  and leave the for-loops for process that really need them
>
>
>>
>> y<-numeric(length(x))
>>
>> for(i in 1:length(x))
>>
>> ifelse(!is.na(x[i]), y[i]<-x[i],
>
>
> y[!is.na(x)] <- x[!is.na(x)]  # No loop.
>
>
>>
>> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
>
> When you index outside the range of the length of x you get NA as a result. Furthermore you are setting y to be only a single element. So I think 'y' will be a single NA at the end of all this.
>
>> strataID <- sample(1:2, 10, repl=TRUE)
>> strataID
>  [1] 1 1 2 2 1 2 2 2 2 1
>
>> for(i in 1:length(x)) {ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1])}
>> y
> [1] NA
>
>  There is no implicit indexing of the LHS of an assignment operation. How long is strataID? And why not do this inside a dataframe?
>
>>
>> Presumably, complicated loops would be more intensive than the nested if
>> statement above. If I write more efficient loops time will come down but I
>> wonder if I will ever be able to write efficient enough code to perform a
>> complicated loop over 2 million rows in a reasonable time.
>>
>> Is it useless for me to try to do any complicated loops on 2 million rows,
>> or if I get much better at programming in R will it be manageable even for
>> complicated situations?
>>
>
> You will gain efficiency when you learn vectorization. And when you learn to test your code for correct behavior.
>
>>
>> Jay
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> and provide commented, minimal, self-contained, reproducible code.

--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

```