[R] Alternate ways of finding number of occurrence of an element in a vector.

David Winsemius dwinsemius at comcast.net
Sun Jun 21 18:39:00 CEST 2009


If one puts the gc() call prior to the expressions themselves, one  
gets consistently ...  different results:

library("rbenchmark")
v<-rep(1:500,1:500); x<-5; benchmark(
      which= c(gc(),length(which(x==v))),  index= c(gc(),  
length(v[v==x])), sum= c(gc(), sum(v==x)),
      replications=200,  columns=c("test","elapsed"), order="elapsed" )
    test elapsed
3   sum   3.299
2 index   3.536
1 which   4.172

Since the gc call takes up mor than half the time, the differences may  
be more dramatic

 > v<-rep(1:500,1:500); x<-5; benchmark(
+      which= c(gc()),  index= c(gc()), sum= c(gc()),
+      replications=200,  columns=c("test","elapsed"), order="elapsed" )
    test elapsed
2 index   2.621
3   sum   2.621
1 which   2.631

 > within( benchmark(
+      which= c(gc(),length(which(x==v))),  index= c(gc(),  
length(v[v==x])), sum= c(gc(), sum(v==x)),
+      replications=200,  columns=c("test","elapsed"),  
order="elapsed" ), {corrected = elapsed-2.62})
    test elapsed corrected
3   sum   3.304     0.684
2 index   3.543     0.923
1 which   4.180     1.560

So the "answer" may not be so simple.



Allan Engelhardt wrote:

> Answering my own question: if I explicitly garbage collecte before the
> benchmark then 'index' always wins, which probably also answers the
> original question.
>
> v<-rep(1:1000,1:1000); x<-5; gc(); benchmark(replications=200,
> columns=c("test","elapsed"), order="elapsed",  
> which=length(which(x==v)),
> index=length(v[v==x]), sum=sum(v==x))
>

> On 19/06/09 16:51, Allan Engelhardt wrote:
>> When trying out a couple of different approaches to this problem I  
>> get
>> rather different answers between runs.  Anybody know why?
>>
>>> library("rbenchmark")
>>> v<-rep(1:1000,1:1000); x<-5; benchmark(replications=200,
>> columns=c("test","elapsed"), order="elapsed",
>> which=length(which(x==v)), index=length(v[v==x]), sum=sum(v==x))
>>   test elapsed
>> 3   sum   2.513
>> 2 index   5.512
>> 1 which   6.712
>>> v<-rep(1:1000,1:1000); x<-5; benchmark(replications=200,
>> columns=c("test","elapsed"), order="elapsed",
>> which=length(which(x==v)), index=length(v[v==x]), sum=sum(v==x))
>>   test elapsed
>> 3   sum   2.502
>> 2 index   3.779
>> 1 which   6.650
>>> v<-rep(1:1000,1:1000); x<-5; benchmark(replications=200,
>> columns=c("test","elapsed"), order="elapsed",
>> which=length(which(x==v)), index=length(v[v==x]), sum=sum(v==x))
>>   test elapsed
>> 2 index   3.796
>> 3   sum   5.808
>> 1 which   6.633
>>
>> This pattern appears to repeat (so on the next two runs "sum" will  
>> win
>> followed by "index" followed by "sum" twice followed by "index" ...)
>>

>>
>> On 19/06/09 14:55, Praveen Surendran wrote:
>>> Hi,
>>>
>>> I have a vector "v" and would like to find the number of  
>>> occurrence of
>>> element "x" in the same.
>>>
>>> Is there a way other than,
>>>
>>> sum(as.integer(v==x)) or length(which(x==v))
>>>
>>> to do the this.
>>>
>>> I have a huge file to process and do this.  Both the above described
>>> methods
>>> are pretty slow while dealing with a large vector.
>>>
>>> Please have your comments.
>>>
>>> Praveen Surendran.
>>>
>>>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list