[R] a problem of approach

jim holtman jholtman at gmail.com
Wed Jun 27 19:11:29 CEST 2012


If you look, half of the time is spent in the 'findSubsets" function
and the other half in determining where the differences are in the
sets.  Is there a faster way of doing what findSubsets does since it
is the biggest time consumer.  The setdiff might be speeded up by
using 'match'.

On Wed, Jun 27, 2012 at 12:51 PM, Adrian Duşa <dusa.adrian at gmail.com> wrote:
> Hi Jim,
>
> On Wed, Jun 27, 2012 at 7:27 PM, jim holtman <jholtman at gmail.com> wrote:
>> One place to start is to use Rprof to see where time is being spent.
>> I used the sample you sent and this is what I got:
>>
>>
>>  0  16.7 root
>>  1.   16.2 system.time
>>  2. .   16.1 testfoo
>>  3. . .   16.1 setdiff
>>  4. . . .    8.2 as.vector
>>  5. . . . .    8.2 findSubsets
>>  6. . . . . .    6.4 increment
>>  7. . . . . . .    4.2 as.vector
>>  8. . . . . . . .    3.6 outer
>>  9. . . . . . . . .    0.3 rep.int
>>  7. . . . . . .    1.6 c
>>  7. . . . . . .    0.2 max
>>  4. . . .    7.9 unique
>>  5. . . . .    7.3 match
>>  5. . . . .    0.3 unique.default
>>  1.    0.5 sort
>>  2. .    0.5 standardGeneric
>>  3. . .    0.3 sample
>>  3. . .    0.2 sort
>>  4. . . .    0.2 sort.default
>>  5. . . . .    0.2 sort.int
>>
>> Of the 16.7 seconds to execute the code, 16.1 was taken up in
>> 'setdiff'.  Maybe there is some other way you can determine the
>> difference.  So if you continue to use 'setdiff', it does not look
>> like there is much that can be done.
>
> One thing to notice is that setdiff() is part of the while() loop.
>
> I could in principle loop over the entire vector and eliminate (all)
> the derived numbers at the end, but I have a hunch it might take even
> longer. The point of setdiff() was to progressively shorten the vector
> in order to minimize the time spent in the loop. On the other hand,
> setdiff() overwrites the vector at each iteration and that of course
> also takes time.
>
> I thought a C program might prove to be faster (because of the faster
> looping over each value in the vector), but although it works just
> fine it seems I am unable to properly use C, given the similar long
> time spent (probably because of toying with the memory too much).
>
> Well, any other quicker alternative would do...
> Thanks,
> Adrian
>
> --
> Adrian Dusa
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd.
> 050025 Bucharest sector 5
> Romania
> Tel.:+40 21 3126618 \
>        +40 21 3120210 / int.101
> Fax: +40 21 3158391



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list