[R] Create sequential vector for values in another column

William Dunlap wdunlap at tibco.com
Fri Oct 11 19:23:02 CEST 2013


> I think all of the above call lapply(split()) at some point and that can use
> a lot of memory when there are lots of unique values in x.  You can use
> a sort-based algorithm to avoid that problem.

E.g.,

Sequence <-
function(nvec) {
    # like base::sequence, but faster for long nvec.  If sum(nvec)>=2^31
    # it will mess up.
   seq_len(sum(nvec)) - rep(cumsum(c(0L,nvec[-length(nvec)])), nvec)
}
f5 <-
function(x){
   ux <- unique(x)
   code <- match(x, ux)
   retval <- integer(length(x))
   retval[order(code)] <- Sequence(tabulate(code))
   retval
}

> x <- sample(rep(1:10e6, each=2)) # 10 million groups of size 2, unsorted
> system.time(r4 <- f4(x))
   user  system elapsed 
 216.74    0.29  217.14 
> system.time(r5 <- f5(x))
   user  system elapsed 
  17.26    0.01   17.27 
> identical(r4,r5)
[1] TRUE

If you know your groups are contiguous you can modify that to be faster still.

All these methods mess up if there are NA's in the data.  It is probably best
to run them on the NA-less part of the data as in
  > x <- c(10,10,10,NA,10, 20,20, 10, NA)
  > id <- integer(length(x)) + NA
  > id[!is.na(x)] <- f5(x[!is.na(x)])
  > id
  [1]  1  2  3 NA  4  1  2  5 NA
    
Don't memorize this algorithm: store the function under a name
like withinGroupSequenceNo and call the function when needed.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of William Dunlap
> Sent: Friday, October 11, 2013 9:51 AM
> To: arun; Steven Ranney; r-help at r-project.org
> Subject: Re: [R] Create sequential vector for values in another column
> 
> At this point 3 functions have been suggested and I'll add a 4th:
>   f1 <- function(x)unlist(lapply(unname(split(rep.int(1L,length(x)), x)), cumsum))
>   f2 <- function(x)unlist(sapply(rle(x)$lengths, function(k) 1:k ))
>   f3 <- function(x)ave(x,x,FUN=seq)
>   f4 <- function(x)ave(seq_along(x), x, FUN=seq_along)
> You can compare their results with ftest (as long as their results have the
> same lengths):
>   ftest <- function(x) {
>      data.frame(x, f1=f1(x), f2=f2(x), f3=f3(x), f4=f4(x))
>   }
> They all return the same result for the Steven's sample data, which is numeric
> and in sorted order:
>   x0 <- c(123.45, 123.45, 123.45, 123.45, 234.56,
>                234.56, 234.56, 234.56, 234.56, 234.56, 234.56, 345.67, 345.67,
>                345.67, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78,
>               456.78, 456.78)
> However, f1() gives the wrong answer if x is not sorted:
>   > ftest(c(30,30,30, 20,20))
>      x f1 f2 f3 f4
>   1 30  1  1  1  1
>   2 30  2  2  2  2
>   3 30  1  3  3  3
>   4 20  2  1  1  1
>   5 20  3  2  2  2
> 
> f1() and f2() give the wrong answer if the groups are split up in the data
>   > ftest(c(10,10, 8,8,8, 10,10,10)) # 10's not contiguous
>      x f1 f2 f3 f4
>   1 10  1  1  1  1
>   2 10  2  2  2  2
>   3  8  3  1  1  1
>   4  8  1  2  2  2
>   5  8  2  3  3  3
>   6 10  3  1  3  3
>   7 10  4  2  4  4
>   8 10  5  3  5  5
> (It is not clear what result the OP wants here.)
> 
> f3() gives the wrong answer if x is not numeric
>   > f3(c("a","a","a", "b","b"))
>   [1] "1" "2" "3" "1" "2"
> 
> f3() also gives an ominous warning if there is singleton in x (be
>   > f3(c(1,1,1, 11))
>   [1] 1 2 3 1
>   Warning message:
>   In `split<-.default`(`*tmp*`, g, value = lapply(split(x, g), FUN)) :
>     number of items to replace is not a multiple of replacement length
> 
> f2() fails to give an answer if x is a factor
>   > f2(factor(c("x","y","z")))
>   Error in rle(x) : 'x' must be an atomic vector
> 
> I think f4 gives the correct result for all those cases.
> 
> I think all of the above call lapply(split()) at some point and that can use
> a lot of memory when there are lots of unique values in x.  You can use
> a sort-based algorithm to avoid that problem.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> > Of arun
> > Sent: Friday, October 11, 2013 6:43 AM
> > To: Steven Ranney; r-help at r-project.org
> > Subject: Re: [R] Create sequential vector for values in another column
> >
> >
> >
> > Also,
> >
> > it might be faster to use ?data.table()
> > library(data.table)
> >  dt1<- data.table(dat1,key='id.name')
> > dt1[,x:=seq(.N),by='id.name']
> > A.K.
> >
> >
> > On , arun <smartpink111 at yahoo.com> wrote:
> > Hi,
> > Try:
> > dat1<-
> >
> > structure(list(id.name = c(123.45, 123.45, 123.45, 123.45, 234.56,
> > 234.56, 234.56, 234.56, 234.56, 234.56, 234.56, 345.67, 345.67,
> > 345.67, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78,
> > 456.78, 456.78)), .Names = "id.name", class = "data.frame", row.names = c(NA,
> > -23L))
> > dat1$x <- with(dat1,ave(id.name,id.name,FUN=seq))
> > A.K.
> >
> >
> >
> > On Friday, October 11, 2013 9:28 AM, Steven Ranney <steven.ranney at gmail.com>
> > wrote:
> > Hello all -
> >
> > I have an example column in a dataFrame
> >
> > id.name
> > 123.45
> > 123.45
> > 123.45
> > 123.45
> > 234.56
> > 234.56
> > 234.56
> > 234.56
> > 234.56
> > 234.56
> > 234.56
> > 345.67
> > 345.67
> > 345.67
> > 456.78
> > 456.78
> > 456.78
> > 456.78
> > 456.78
> > 456.78
> > 456.78
> > 456.78
> > 456.78
> > ...
> > [truncated]
> >
> > And I'd like to create a second vector of sequential values (i.e., 1:N) for
> > each unique id.name value.  In other words, I need
> >
> > id.name  x
> > 123.45   1
> > 123.45   2
> > 123.45   3
> > 123.45   4
> > 234.56   1
> > 234.56   2
> > 234.56   3
> > 234.56   4
> > 234.56   5
> > 234.56   6
> > 234.56   7
> > 345.67   1
> > 345.67   2
> > 345.67   3
> > 456.78   1
> > 456.78   2
> > 456.78   3
> > 456.78   4
> > 456.78   5
> > 456.78   6
> > 456.78   7
> > 456.78   8
> > 456.78   9
> >
> > The number of unique id.name values is different; for some values, nrow()
> > may be 42 and for others it may be 36, etc.
> >
> > The only way I could think of to do this is with two nested for loops.  I
> > tried it but because this data set is so large (nrow = 112,679 with 2,161
> > unique values of id.name), it took several hours to run.
> >
> > Is there an easier way to create this vector?  I'd appreciate your thoughts.
> >
> > Thanks -
> >
> > SR
> > Steven H. Ranney
> >
> >     [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list