[R] help: program efficiency

William Dunlap wdunlap at tibco.com
Thu Nov 25 18:31:09 CET 2010


If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick

nodup3 <- function (t) { 
    t + (sequence(rle(t)$lengths) - 1)/100
}

If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code

nodup2 <- function (t) { 
    ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}

E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:

> a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE))
> system.time(v <- nodup(a2))
   user  system elapsed 
   2.78    0.05    3.97 
> system.time(v2 <- nodup2(a2))
   user  system elapsed 
   1.83    0.02    2.66 
> system.time(v3 <- nodup3(a2))
   user  system elapsed 
   0.18    0.00    0.14 
> identical(v,v2) && identical(v,v3)
[1] TRUE

If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:

nodup3a <- function (t) {
    faster.sequence <- function(nvec) {
        seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), 
            nvec)
    }
    t + (faster.sequence(rle(t)$lengths) - 1)/100
}

That took 0.05 seconds on the a2 dataset and produced
identical results.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
> Sent: Thursday, November 25, 2010 6:49 AM
> To: r-help at r-project.org
> Subject: [R] help: program efficiency
> 
> 
> hey guys,
> 
> I am working on a function to make a duplicated value unique. 
> For example,
> the original vector would be like : a = c(2,1,1,3,3,3,4)
> I'll like to transform it into:
> a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
> basically, find the duplicates and assign a unique value by 
> adding a small
> amount and keep it in order.
> I come up with the following codes, but it runs slow if t is 
> large. Is there
> a better way to do it?
> nodup = function(t)
> {
>   t.index=0
>   t.dup=duplicated(t)
>   for (i in 2:length(t))
>   {
>     if (t.dup[i]==T)
>       t.index=t.index+0.01
>     else t.index=0
>     t[i]=t[i]+t.index
>   }
>   return(t)
> }
> 
> 
> -- 
> View this message in context: 
> http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
9p3059079.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list