[R] help: program efficiency

Sat Nov 27 15:07:41 CET 2010

Hello,

Someone pointed out to me off list about this construct:

nodup_sort <- function(x, fun = nodup3){
     i <- sort.list(x)
     x[i] <- fun(x[i])
     x
}

which deals more efficiently with the reordering.

 > x <- sample( 1:100000, size = 300000, replace = TRUE )
 > system.time( nodup_cpp( x ) )
utilisateur     système      écoulé
       0.127       0.005       0.132
 > system.time( nodup_sort( x, nodup3 ) )
utilisateur     système      écoulé
       0.287       0.009       0.295
 > system.time( nodup_sort( x, nodup3a ) )
utilisateur     système      écoulé
       0.168       0.010       0.179
 > system.time( nodup_sort( x, nodup4 ) )
utilisateur     système      écoulé
       0.157       0.005       0.163
 > system.time( nodup_sort( x, nodup_cpp_assumingsorted ) )
utilisateur     système      écoulé
       0.096       0.001       0.097

So in this example, it seems more efficient to sort first and use the 
algorithm assuming that the data is sorted.

There is probably a way to be smarter in nodup_cpp where the bottleneck 
is likely to be related to map::find.

Another version taking some more internally :

nodup_cpp_hybrid <- cxxfunction( signature( x_ = "numeric", sort_ = 
"integer" ), '

     NumericVector input(x_) ;
     NumericVector x  = clone<NumericVector>( x_)  ;
     IntegerVector sort( sort_ ) ;

     int n = x.size() ;
     double current, previous = input[ sort[0] - 1 ] ;
     double index = 0.0 ;
     int si ;
     for( int i=1; i<n; i++){
         si = sort[i] - 1;
         current = input[ si ] ;
         if( current == previous ){
             index += .01 ;
             x[ si ] = current + index ;
         } else {
             index = 0.0 ;
         }
         previous = current ;
     }
     return x ;
', plugin = "Rcpp" )

no big difference:

 > system.time( res6 <- nodup_cpp_hybrid( x, sort.list(x) ) )
utilisateur     système      écoulé
       0.092       0.000       0.092

Profiling reveals this:

 > Rprof()
 > for(i in 1:100) {  res6 <- ( nodup_cpp_hybrid( x, sort.list(x) ) ) }
 > Rprof(NULL)
 > summaryRprof()
$by.self
               self.time self.pct total.time total.pct
"sort.list"        6.50    90.03       6.50     90.03
".Call"            0.42     5.82       0.42      5.82
"file.exists"      0.30     4.16       0.30      4.16

$by.total
                    total.time total.pct self.time self.pct
"nodup_cpp_hybrid"       7.22    100.00      0.00     0.00
"sort.list"              6.50     90.03      6.50    90.03
".Call"                  0.42      5.82      0.42     5.82
"file.exists"            0.30      4.16      0.30     4.16

$sample.interval
[1] 0.02

$sampling.time
[1] 7.22

The 4.16 % taken by file.exists indicates that someone in the inline 
project has to do some work (on my TODO list).

But otherwise sort.list dominates the time.

Romain

Le 26/11/10 21:22, Romain Francois a écrit :
>
> Le 26/11/10 21:13, Romain Francois a écrit :
>>
>> Hello,
>>
>> Can we really make the assumption that the data is sorted. The original
>> example was not:
>>
>>> I am working on a function to make a duplicated value unique. For
>>> example,
>>> the original vector would be like : a = c(2,1,1,3,3,3,4)
>>
>> If we can make the assumption, here is a C++ based version:
>>
>>
>> nodup_cpp_assumingsorted <- cxxfunction( signature( x_ = "numeric" ), '
>>
>> // since we modify x, we need to make a copy
>> NumericVector x = clone<NumericVector>(x_);
>>
>> int n = x.size() ;
>> double current, previous = x[0] ;
>> int index ;
>> for( int i=1; i<n; i++){
>> current = x[i] ;
>> if( current == previous ){
>> x[i] = current + (++index) / 100.0 ;
>> } else {
>> index = 0 ;
>> }
>> previous = current ;
>> }
>> return x ;
>> ', plugin = "Rcpp" )
>>
>>
>> with these results:
>>
>> > x <- sort( sample( 1:100000, size = 300000, replace = TRUE ) )
>>
>> > system.time( nodup3( x ) )
>> utilisateur système écoulé
>> 0.090 0.004 0.094
>> > system.time( nodup3a( x ) )
>> utilisateur système écoulé
>> 0.036 0.005 0.040
>> > system.time( nodup4( x ) )
>> utilisateur système écoulé
>> 0.025 0.004 0.029
>> > system.time( nodup_cpp_assumingsorted( x) )
>> utilisateur système écoulé
>> 0.003 0.001 0.004
>>
>>
>>
>> Now, if we don't make the assumption that the data is sorted, here is
>> another C++ based version:
>>
>> require( inline )
>> require( Rcpp )
>> nodup_cpp <- cxxfunction( signature( x_ = "numeric" ), '
>>
>> // since we modify x, we need to make a copy
>> NumericVector x = clone<NumericVector>(x_);
>>
>> typedef std::map<double,int> imap ;
>> typedef imap::value_type pair ;
>> imap index ;
>> int n = x.size() ;
>> double current, previous = x[0] ;
>> index.insert( pair( previous, 0 ) );
>>
>> imap::iterator it = index.begin() ;
>> for( int i=1; i<n; i++){
>> current = x[i] ;
>> if( current == previous ){
>> x[i] = current + ( ++(it->second) / 100.0 ) ;
>> } else {
>> it = index.find(current) ;
>> if( it == index.end() ){
>> it = index.insert(
>> current > previous ? it : index.begin(),
>> pair( current, 0 )
>> ) ;
>> } else {
>> x[i] = current + ( ++(it->second) / 100.0 ) ;
>> }
>> previous = current ;
>> }
>> }
>> return x ;
>> ', plugin = "Rcpp" )
>>
>>
>> which gives me this :
>>
>> > x <- sample( 1:100000, size = 300000, replace = TRUE )
>> >
>> > system.time( nodup_cpp( x ) )
>> utilisateur système écoulé
>> 0.111 0.002 0.113
>> > system.time( nodup3( sort( x ) ) )
>> utilisateur système écoulé
>> 0.162 0.011 0.172
>> > system.time( nodup3a( sort( x ) ) )
>> utilisateur système écoulé
>> 0.099 0.009 0.109
>> > system.time( nodup4( sort( x ) ) )
>> utilisateur système écoulé
>> 0.089 0.004 0.094
>>
>> so nodup4 is still faster, but the values are not in the right order:
>>
>> > x <- c( 2, 1, 1, 2 )
>> > nodup4( sort( x ) )
>> [1] 1.00 1.01 2.00 2.01
>> > nodup_cpp( x )
>> [1] 2.00 1.00 1.01 2.01
>>
>> Romain
>
> I think this gives a more fair comparison :
>
>  > system.time( nodup_cpp( x ) )
> utilisateur système écoulé
> 0.113 0.002 0.114
>  > system.time( { oo <- order(order(x)) ; nodup3( sort( x ) )[oo] } )
> utilisateur système écoulé
> 0.336 0.012 0.347
>  > system.time( { oo <- order(order(x)) ; nodup3a( sort( x ) )[oo] } )
> utilisateur système écoulé
> 0.251 0.011 0.262
>  > system.time( { oo <- order(order(x)) ; nodup4( sort( x ) )[oo] } )
> utilisateur système écoulé
> 0.287 0.006 0.294
>
>
> Romain
>
>> Le 26/11/10 20:01, William Dunlap a écrit :
>>>
>>>> -----Original Message-----
>>>> From: William Dunlap
>>>> Sent: Thursday, November 25, 2010 9:31 AM
>>>> To: 'randomcz'; r-help at r-project.org
>>>> Subject: RE: [R] help: program efficiency
>>>>
>>>> If the input vector t is known to be ordered
>>>> (or if you only care about runs of duplicated
>>>> values, not all duplicated values) the following
>>>> is pretty quick
>>>>
>>>> nodup3<- function (t) {
>>>> t + (sequence(rle(t)$lengths) - 1)/100
>>>> }
>>>>
>>>> If you don't know if the the input will be ordered
>>>> then ave() will do it a bit faster than your
>>>> code
>>>>
>>>> nodup2<- function (t) {
>>>> ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
>>>> }
>>>>
>>>> E.g., for a sorted sequence of 300,000 numbers drawn with
>>>> replacement from 1:100,000 I get:
>>>>
>>>>> a2<- sort(sample(1:1e5, size=3e5, replace=TRUE))
>>>>> system.time(v<- nodup(a2))
>>>> user system elapsed
>>>> 2.78 0.05 3.97
>>>>> system.time(v2<- nodup2(a2))
>>>> user system elapsed
>>>> 1.83 0.02 2.66
>>>>> system.time(v3<- nodup3(a2))
>>>> user system elapsed
>>>> 0.18 0.00 0.14
>>>>> identical(v,v2)&& identical(v,v3)
>>>> [1] TRUE
>>>>
>>>> If speed is truly an issue, the built-in sequence may
>>>> be replaced by a faster one that does the same thing:
>>>>
>>>> nodup3a<- function (t) {
>>>> faster.sequence<- function(nvec) {
>>>> seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
>>>> nvec)
>>>> }
>>>> t + (faster.sequence(rle(t)$lengths) - 1)/100
>>>> }
>>>>
>>>> That took 0.05 seconds on the a2 dataset and produced
>>>> identical results.
>>>
>>> rle() computes a sort of second difference and
>>> nodup3a computes a cumsum on that second diffence,
>>> to get back to a first difference. The following
>>> avoids that wasted operation (along with rle's
>>> computation of the values component of its output).
>>>
>>> nodup4<- function(t) {
>>> n<- length(t)
>>> p<- c(0L, which(t[-1L] != t[-n]), n)
>>> t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
>>> }
>>>
>>> That reduced nodup3a's time by about 30% on that dataset.
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org
>>>>> [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
>>>>> Sent: Thursday, November 25, 2010 6:49 AM
>>>>> To: r-help at r-project.org
>>>>> Subject: [R] help: program efficiency
>>>>>
>>>>>
>>>>> hey guys,
>>>>>
>>>>> I am working on a function to make a duplicated value unique.
>>>>> For example,
>>>>> the original vector would be like : a = c(2,1,1,3,3,3,4)
>>>>> I'll like to transform it into:
>>>>> a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
>>>>> basically, find the duplicates and assign a unique value by
>>>>> adding a small
>>>>> amount and keep it in order.
>>>>> I come up with the following codes, but it runs slow if t is
>>>>> large. Is there
>>>>> a better way to do it?
>>>>> nodup = function(t)
>>>>> {
>>>>> t.index=0
>>>>> t.dup=duplicated(t)
>>>>> for (i in 2:length(t))
>>>>> {
>>>>> if (t.dup[i]==T)
>>>>> t.index=t.index+0.01
>>>>> else t.index=0
>>>>> t[i]=t[i]+t.index
>>>>> }
>>>>> return(t)
>>>>> }
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
>>>> 9p3059079.html
>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>
>
>

-- 
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/9VOd3l : ZAT! 2010
|- http://bit.ly/c6DzuX : Impressionnism with R
`- http://bit.ly/czHPM7 : Rcpp Google tech talk on youtube