[R] a problem of approach

Wed Jun 27 19:11:29 CEST 2012

If you look, half of the time is spent in the 'findSubsets" function
and the other half in determining where the differences are in the
sets.  Is there a faster way of doing what findSubsets does since it
is the biggest time consumer.  The setdiff might be speeded up by
using 'match'.

On Wed, Jun 27, 2012 at 12:51 PM, Adrian Duşa <dusa.adrian at gmail.com> wrote:
> Hi Jim,
>
> On Wed, Jun 27, 2012 at 7:27 PM, jim holtman <jholtman at gmail.com> wrote:
>> One place to start is to use Rprof to see where time is being spent.
>> I used the sample you sent and this is what I got:
>>
>>
>>  0  16.7 root
>>  1.   16.2 system.time
>>  2. .   16.1 testfoo
>>  3. . .   16.1 setdiff
>>  4. . . .    8.2 as.vector
>>  5. . . . .    8.2 findSubsets
>>  6. . . . . .    6.4 increment
>>  7. . . . . . .    4.2 as.vector
>>  8. . . . . . . .    3.6 outer
>>  9. . . . . . . . .    0.3 rep.int
>>  7. . . . . . .    1.6 c
>>  7. . . . . . .    0.2 max
>>  4. . . .    7.9 unique
>>  5. . . . .    7.3 match
>>  5. . . . .    0.3 unique.default
>>  1.    0.5 sort
>>  2. .    0.5 standardGeneric
>>  3. . .    0.3 sample
>>  3. . .    0.2 sort
>>  4. . . .    0.2 sort.default
>>  5. . . . .    0.2 sort.int
>>
>> Of the 16.7 seconds to execute the code, 16.1 was taken up in
>> 'setdiff'.  Maybe there is some other way you can determine the
>> difference.  So if you continue to use 'setdiff', it does not look
>> like there is much that can be done.
>
> One thing to notice is that setdiff() is part of the while() loop.
>
> I could in principle loop over the entire vector and eliminate (all)
> the derived numbers at the end, but I have a hunch it might take even
> longer. The point of setdiff() was to progressively shorten the vector
> in order to minimize the time spent in the loop. On the other hand,
> setdiff() overwrites the vector at each iteration and that of course
> also takes time.
>
> I thought a C program might prove to be faster (because of the faster
> looping over each value in the vector), but although it works just
> fine it seems I am unable to properly use C, given the similar long
> time spent (probably because of toying with the memory too much).
>
> Well, any other quicker alternative would do...
> Thanks,
> Adrian
>
> --
> Adrian Dusa
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd.
> 050025 Bucharest sector 5
> Romania
> Tel.:+40 21 3126618 \
>        +40 21 3120210 / int.101
> Fax: +40 21 3158391

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.