[R] Complex sort problem

Petr Savicky savicky at cs.cas.cz
Thu May 17 19:43:27 CEST 2012


On Thu, May 17, 2012 at 06:45:52AM -0400, Axel Urbiz wrote:
> Dear List,
> 
> Is there a way I can sort a sample based on a sort index constructed from
> the data from which the sample is taken? Basically, I need to take 'many'
> samples from the same source data and sort them. This can be very time
> consuming for long vectors. Is there any way I can sort the data only once
> initially, and use that sort order for the samples?
> 
> I believe that idea is what is implemented in tree-based classifiers, so
> the data is sorted only once initially and that sort order is used for the
> child nodes.
> 
> 
> set.seed(12345)
> x <- sample(0:100, 10)
> x.order <- order(x)
> x.sorted <- x[x.order]
> 
> sample.ind <- sample(1:length(x), 5, replace = TRUE)  #sample 1/2 size with replacement
> x.sample <- x[sample.ind]
> 
> x.sample.sorted <-   #??? (without sorting again)

Hi.

Formally, it is possible to avoid sorting using tabulate() and rep().
However, i am not sure, whether this approach is more efficient.

  set.seed(12345)
  x <- sample(0:100, 10)
  x.order <- order(x)
  x.sorted <- x[x.order]
 
  sample.ind <- sample(1:length(x), 5, replace = TRUE)  #sample 1/2 size with replacement
 
  x.sample <- x.sorted[sample.ind]
  freq <- tabulate(sample.ind, nbins=length(x))
  x.sample.sorted <- rep(x.sorted, times=freq)

  identical(sort(x.sample), x.sample.sorted) # [1] TRUE

Note that x.sample is created from x.sorted in order to make x.sample
and x.sample.sorted consistent. Since sample.ind has random order, the
distributions of x[sample.ind] and x.sorted[sample.ind] are the same.

Computing the frequencies of indices, whose range is known in advance,
can be done in linear time, so theoretically more efficiently than
sorting. However, only a test may determine, what is more efficient in
your situation.

Hope this helps.

Petr Savicky.



More information about the R-help mailing list