[R] stats 'dist' euclidean distance calculation

Bert Gunter bgunter.4567 at gmail.com
Thu Mar 15 15:53:23 CET 2018


.... and I believe this whole thread may fit better at the Bioconductor
list rather than here.

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Thu, Mar 15, 2018 at 5:11 AM, S Ellison <S.Ellison at lgcgroup.com> wrote:

> > 3x3 subset used
> >                          Locus1     Locus2         Locus3
> > Samp1               GG           <NA>           GG
> > Samp2               AG             CA              GA
> > Samp3               AG             CA              GG
> >
> > The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2))
> My
> > assumption was that the difference between x_i and y_i would be the
> number
> > of allelic differences at each base pair site between samples.
>
> Base R does not share your assumption, which (from a general purpose stats
> point of view) would be a completely outlandish interpretation of the data.
> As far as base R is concerned, these are just arbitrary character strings
> represented (by default) as factors. Since factors are, internally,
> integers assigned (by default) in increasing lexical order to the levels
> present, if you apply dist() to factors constructed from allele data, you
> will usually get complete nonsense in genetic terms.
>
> You should probably look at something like dist.gene in the ape package:
> see
> https://www.rdocumentation.org/packages/ape/versions/5.0/topics/dist.gene
>
> S Ellison
>
>
> *******************************************************************
> This email and any attachments are confidential. Any u...{{dropped:13}}



More information about the R-help mailing list