[R] millions of comparisons, speed wanted

Martin Maechler maechler at stat.math.ethz.ch
Fri Dec 16 16:27:42 CET 2005


I have not taken the time to look into this example,
but
	daisy()
from the (recommended, hence part of R) package 'cluster'
is more flexible than dist(), particularly in the case of NAs
and for (a mixture of continuous and) categorical variables.

It uses a version of Gower's formula in order to deal with NAs
and asymmetric binary variables.  The example below look like
very well matching to this problem.

Regards,
Martin Maechler, ETH Zurich


>>>>> "Adrian" == Adrian DUSA <adi at roda.ro>
>>>>>     on Thu, 15 Dec 2005 22:04:01 +0200 writes:

    Adrian> Dear Andy,
    Adrian> On Thursday 15 December 2005 20:57, Liaw, Andy wrote:
    >> Just some untested idea:
    >> If the data are all 0/1, you could use dist(input, method="manhattan"), and
    >> then check which entry equals 1.  This should be much faster than creating
    >> all pairs of rows and check position-by-position.

    Adrian> Thanks for the idea, I played a little with it. At the beginning yes, the data 
    Adrian> are all 0/1, but during the minimizing iterations there are also "x" values; 
    Adrian> for example comparing:
    Adrian> 0 1 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> should return
    Adrian> 0 "x" 0 1 1

    Adrian> whereas
    Adrian> 0 "x" 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> shouldn't even be compared (they have different number of figures).

    Adrian> Replacing "x" with NA in dist is not yielding results either, as with
    Adrian> NA 0 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> dist returns 0.

    Adrian> I even wanted to see if I could tweak the dist code, but it calls a C program 
    Adrian> and I gave up.

    Adrian> Nice idea anyhow, maybe I'll find a way to use it further.
    Adrian> Best,
    Adrian> Adrian

    Adrian> -- 
    Adrian> Adrian DUSA
    Adrian> Romanian Social Data Archive
    Adrian> 1, Schitu Magureanu Bd
    Adrian> 050025 Bucharest sector 5
    Adrian> Romania
    Adrian> Tel./Fax: +40 21 3126618 \
    Adrian> +40 21 3120210 / int.101

    Adrian> ______________________________________________
    Adrian> R-help at stat.math.ethz.ch mailing list
    Adrian> https://stat.ethz.ch/mailman/listinfo/r-help
    Adrian> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html




More information about the R-help mailing list