[BioC] How to decide which distance metric to use for micoarray data clustering?

Wed Oct 7 18:09:31 CEST 2009

On Wed, Oct 7, 2009 at 11:06 AM, Sean Davis <seandavi at gmail.com> wrote:
> On Wed, Oct 7, 2009 at 11:53 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
>> On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at gmail.com> wrote:
>>>
>>>
>>> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>>>
>>>> Besides the distance metrics, there are other things that may also be
>>>> important. For example, multiple probesets map to a same gene. I can
>>>> do clustering on probeset values or on averaged probeset values of
>>>> genes. What factors should I consider when I make this kind of
>>>> decisions?
>>>>
>>>
>>> It is generally best not to average probes.  You could choose one to be
>>> representative of each gene, but averaging is not the best way to go.
>>
>> Is there any justification why it is not good to average probes?
>
> It is pretty simple, actually.  Different probes for the same gene do
> not measure the same thing.  In statistical terms, they are not drawn
> from the same distribution.
>
>>>> bioDist says something about two popular metrics, but the description
>>>> is distilled. I am wondering whether there are some more detailed
>>>> comparisons between metrics.
>>>
>>> Often, the metrics produce highly compatible pictures of the data.  The
>>> actual metric you will use may be directed somewhat by the goals of the
>>> analysis but, at least for hierarchical clustering, I think it is difficult
>>> to argue for one "best" or "recommended" metric.
>>>
>>> In practice, you may want to try a few to see how they behave on your data.
>>
>> If the results by different metrics are different, how to do decide
>> which one I should use?
>
> If you have a gold standard or another source of information about how
> samples/genes should be measured, you can justify your choice based on
> subjects that are supposed to be most similar are.  Lacking such
> information, there are other techniques such as looking at the cluster
> stability under resampling that might be useful to think about.
> Others might have more concrete suggestions about how to go about
> measuring clustering effectiveness; it is a research topic of its own.

Do you have a good reference so that I can trace the current research frontier?

>>>> On Wed, Oct 7, 2009 at 12:35 AM, Tim Triche <tim.triche at gmail.com> wrote:
>>>> > look at the bioDist package for some suggestions.
>>>> >
>>>> > the metric to use depends on your task.
>>>> >
>>>> >
>>>> > On Tue, Oct 6, 2009 at 8:52 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> I am looking for the most appropriate distance metrics for the
>>>> >> clustering of a set of microarray data. And I read Chapter 12 of
>>>> >> Bioinformatics and Computational Biology Solutions Using R and
>>>> >> Bioconductor, But I'm still not clear what the general guide line is
>>>> >> to choose an appropriate distance metrics out of many ones list in
>>>> >> that chapter. Could somebody let me know how to choose an appropriate
>>>> >> distance metrics?