[R] effective way to return only the first argument of "which()"

Mike Spam ichmagspam at googlemail.com
Thu Sep 20 17:39:48 CEST 2012


Thank you very much, especially Milan and Bert!
I will do some speedtests and fit the function to my needs.

I think the best way would be a modified function in C...
But i am not familiar enough with C. Perhaps this would be a simple
but useful extension.
If someone has a solution, i would appreciate a post in this mailing list.

Cheers and thanks to all,
Nico


2012/9/19 Bert Gunter <gunter.berton at gene.com>:
> Well, following up on this observation, which can be put under the
> heading of "Sometimes vectorization can be much slower than explicit
> loops" , I offer the following:
>
>  firsti  <-function(x,k)
> {
>   i <- 1
>   while(x[i]<=k){i <- i+1}
>   i
> }
>
>> system.time(for(i in 1:100)which(x>.99)[1])
>    user  system elapsed
>    19.1     2.4    22.2
>
>> system.time(for(i in 1:100)which.max(x>.99))
>    user  system elapsed
>   30.45    6.75   37.46
>
>> system.time(for(i in 1:100)firsti(x,.99))
>    user  system elapsed
>    0.03    0.00    0.03
>
> ## About a 500 - 1000 fold speedup !
>
>> firsti(x,.99)
> [1] 122
>
> It doesn't seem to scale too badly, either (whatever THAT means!):
> (Of course, the which() versions are essentially identical in timing,
> and so are omitted)
>
>> system.time(for(i in 1:100)firsti(x,.9999))
>    user  system elapsed
>    2.70    0.00    2.72
>
>> firsti(x,.9999)
> [1] 18200
>
> Of course, at some point, the explicit looping is awful -- with k =
> .999999, the index was about 360000, and the timing test took 54
> seconds.
>
> So I guess the point is -- as always -- that the optimal approach
> depends on the nature of the data. Prudence and robustness clearly
> demands the vectorized which() approaches if you have no information.
> But if you do know something about the data, then you can often write
> much faster tailored solutions. Which is hardly revelatory, of course.
>
> Cheers to all,
> Bert
>
> On Wed, Sep 19, 2012 at 8:55 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
>> Le mercredi 19 septembre 2012 à 15:23 +0000, William Dunlap a écrit :
>>> The original method is faster than which.max for longish numeric vectors
>>> (in R-2.15.1), but you should check time and memory usage on your
>>> own machine:
>>>
>>> > x <- runif(18e6)
>>> > system.time(for(i in 1:100)which(x>0.99)[1])
>>>    user  system elapsed
>>>   11.64    1.05   12.70
>>> > system.time(for(i in 1:100)which.max(x>0.99))
>>>    user  system elapsed
>>>   16.38    2.94   19.35
>> If you the probability that such an element appears at the beginning of
>> the vector, a custom hack might well be more efficient. The problem with
>> ">", which() and which.max() is that they will go over all the elements
>> of the vector even if it's not needed at all. So you can start with a
>> small subset of the vector, and increase its size in a few steps until
>> you find the value you're looking for.
>>
>> Proof of concept (the values of n obviously need to be adapted):
>> x <-runif(1e7)
>>
>> find <- function(x, lim) {
>>     len <- length(x)
>>
>>     for(n in 2^(14:0)) {
>>         val <- which(x[seq.int(1L, len/n)] > lim)
>>
>>         if(length(val) > 0) return(val[1])
>>     }
>>
>>     return(NULL)
>> }
>>
>>> system.time(for(i in 1:100)which(x>0.999)[1])
>> utilisateur     système      écoulé
>>       9.740       5.795      15.890
>>> system.time(for(i in 1:100)which.max(x>0.999))
>> utilisateur     système      écoulé
>>      14.288       9.510      24.562
>>> system.time(for(i in 1:100)find(x, .999))
>> utilisateur     système      écoulé
>>       0.017       0.002       0.019
>>> find(x, .999)
>> [1] 1376
>>
>> (Looks almost like cheating... ;-)
>>
>>
>>
>>
>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>
>>> > -----Original Message-----
>>> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>>> > Of Jeff Newmiller
>>> > Sent: Wednesday, September 19, 2012 8:06 AM
>>> > To: Mike Spam; r-help at r-project.org
>>> > Subject: Re: [R] effective way to return only the first argument of "which()"
>>> >
>>> > ?which.max
>>> > ---------------------------------------------------------------------------
>>> > Jeff Newmiller                        The     .....       .....  Go Live...
>>> > DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>>> >                                       Live:   OO#.. Dead: OO#..  Playing
>>> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> > /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>>> > ---------------------------------------------------------------------------
>>> > Sent from my phone. Please excuse my brevity.
>>> >
>>> > Mike Spam <ichmagspam at googlemail.com> wrote:
>>> >
>>> > >Hi,
>>> > >
>>> > >I was looking for a function like "which()" but only returns the first
>>> > >argument.
>>> > >Compare:
>>> > >
>>> > >x <- c(1,2,3,4,5,6)
>>> > >y <- 4
>>> > >which(x>y)
>>> > >
>>> > >returns:
>>> > >5,6
>>> > >
>>> > >which(x>y)[1]
>>> > >returns:
>>> > >5
>>> > >
>>> > >which(x>y)[1] is exactly what i need. I did use this but the dataset
>>> > >is too big (~18 mio. Points).
>>> > >That's why i need a more effective way to get the first element of a
>>> > >vector which is bigger/smaller than a specific number.
>>> > >
>>> > >I found "match()" but this function only works for equal numbers.
>>> > >
>>> > >
>>> > >
>>> > >Thanks,
>>> > >Nico
>>> > >
>>> > >______________________________________________
>>> > >R-help at r-project.org mailing list
>>> > >https://stat.ethz.ch/mailman/listinfo/r-help
>>> > >PLEASE do read the posting guide
>>> > >http://www.R-project.org/posting-guide.html
>>> > >and provide commented, minimal, self-contained, reproducible code.
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm




More information about the R-help mailing list