[R] a problem of approach

Wed Jun 27 18:27:46 CEST 2012

One place to start is to use Rprof to see where time is being spent.
I used the sample you sent and this is what I got:

  0  16.7 root
  1.   16.2 system.time
  2. .   16.1 testfoo
  3. . .   16.1 setdiff
  4. . . .    8.2 as.vector
  5. . . . .    8.2 findSubsets
  6. . . . . .    6.4 increment
  7. . . . . . .    4.2 as.vector
  8. . . . . . . .    3.6 outer
  9. . . . . . . . .    0.3 rep.int
  7. . . . . . .    1.6 c
  7. . . . . . .    0.2 max
  4. . . .    7.9 unique
  5. . . . .    7.3 match
  5. . . . .    0.3 unique.default
  1.    0.5 sort
  2. .    0.5 standardGeneric
  3. . .    0.3 sample
  3. . .    0.2 sort
  4. . . .    0.2 sort.default
  5. . . . .    0.2 sort.int

Of the 16.7 seconds to execute the code, 16.1 was taken up in
'setdiff'.  Maybe there is some other way you can determine the
difference.  So if you continue to use 'setdiff', it does not look
like there is much that can be done.

On Wed, Jun 27, 2012 at 10:36 AM, Adrian Duşa <dusa.adrian at gmail.com> wrote:
> Dear R-help list,
>
> Part of a program I wrote seem to take a significant amount of time,
> therefore I am looking for an alternative approach.
> In order to explain what is does:
>
> - the input is a sorted vector of integer numbers
> - some higher numbers may be derived (using a mathematical formula)
> from lower numbers, therefore they should be eliminated
> - at the end, the vector should contain only uniquely defined numbers
>
> Pet hypothetical example, input vector:
> - 2 3 4 5 6 7 8 9 10
> - number 2 generates 4, 7, 10
> - 2 3 5 6 8 9 (surviving vector)
> - number 3 generates 5 and 9
> - 2 3 6 8 (surviving vector)
> - number 6 generates 8
> - final surviving vector 2 3 6
>
> Function foo(x, ...) generates the numbers, my current approach being:
> ####
> index <- 0
> while ((index <- index + 1) < length(numbers)) {
>    numbers <- setdiff(numbers, foo(numbers[index]))
> }
> ####
>
> This seem to take quite some time (but I don't know any other way of
> doing it), hence my question(s):
> - would there be another (quicker) implementation in R?
> - alternatively, should I go for a C implementation?
>
> (actually, I did create a C implementation, but it doesn't bring any
> more speed... it is actually a bit slower).
>
> A real-life pet example, using the function findSubsets() from the QCA
> package (our foo function above):
>
> ####
> library(QCA)
> testfoo <- function(x, y) {
>    index <- 0
>    while((index <- index + 1) < length(x)) {
>        x <- setdiff(x, findSubsets(y, x[index], max(x)))
>    }
>    return(x)
> }
>
> nofl <- rep(3, 14)
> set.seed(12345)
> numbers <- sort(sample(seq(prod(nofl)), 1000000))
>
> system.time(result <- testfoo(numbers, nofl))
> ####
>   user  system elapsed
>  8.168   2.049  10.148
>
> Any hint will be highly appreciated, thanks in advance,
> Adrian
>
> --
> Adrian Dusa
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd.
> 050025 Bucharest sector 5
> Romania
> Tel.:+40 21 3126618 \
>        +40 21 3120210 / int.101
> Fax: +40 21 3158391
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.