[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Hervé Pagès hpages at fredhutch.org
Fri Jan 9 07:21:12 CET 2015


On 01/08/2015 01:30 PM, peter dalgaard wrote:
> If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call...

But you wouldn't bet money on that right? Because you know you would
loose.

> Readability of source code is not usually our prime concern.

Don't sacrifice readability if you do not have a good reason for it.
What's your reason here? Are you seriously suggesting that inlining
makes a significant difference? As Michael pointed out, the expensive
operation here is the hashing. But sadly some people like inlining and
want to use it everywhere: it's easy and they feel good about it, even
if it hurts readability and maintainability (if you use x %in% y
instead of the inlined version, the day someone changes the
implementation of x %in% y for something faster, or fixes a bug
in it, your code will automatically benefit, right now it won't).

More simply put: good readability generally leads to better code.

>
> The && idea does have some merit, though.
>
> Apropos, why is there no setcontains()?

Wait... shouldn't everybody use all(match(x, y, nomatch = 0L) > 0L) ?

H.

>
> -pd
>
>> On 06 Jan 2015, at 22:02 , Hervé Pagès <hpages at fredhutch.org> wrote:
>>
>> Hi,
>>
>> Current implementation:
>>
>> setequal <- function (x, y)
>> {
>>   x <- as.vector(x)
>>   y <- as.vector(y)
>>   all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
>> }
>>
>> First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > 0L'
>> with 'x %in% y' and 'y %in% x', respectively. They're strictly
>> equivalent but the latter form is a lot more readable than the former
>> (isn't this the "raison d'être" of %in%?):
>>
>> setequal <- function (x, y)
>> {
>>   x <- as.vector(x)
>>   y <- as.vector(y)
>>   all(c(x %in% y, y %in% x))
>> }
>>
>> Furthermore, replacing 'all(c(x %in% y, y %in x))' with
>> 'all(x %in% y) && all(y %in% x)' improves readability even more and,
>> more importantly, reduces memory footprint significantly on big vectors
>> (e.g. by 15% on integer vectors with 15M elements):
>>
>> setequal <- function (x, y)
>> {
>>   x <- as.vector(x)
>>   y <- as.vector(y)
>>   all(x %in% y) && all(y %in% x)
>> }
>>
>> It also seems to speed up things a little bit (not in a significant
>> way though).
>>
>> Cheers,
>> H.
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the R-devel mailing list