[BioC] Correct use of a distance measure when clustering gene
expression data
Floor Stam
fjstam at bio.vu.nl
Fri Sep 3 14:42:01 CEST 2004
Hi Mick
I think it depends on the kind of similarity that is important to you.
-If you think it is important that genes that show parallel profiles
are clustered together, use pearsons correlation coefficient. In this
case two genes that peak at the same moment in time, but at a (very)
different height, will be found in the same cluster.
-If you on the other hand think that it is important that genes which
have similar extent of regulation are clustered together, use Euclidian
distance. This clusters together genes of which the peaks occur at
roughly the same height, but of which the profiles are not necessarily
parallel.
So it depends on your question. For timecourse data, i'd say Pearsons
correlation coefficient gives more relevant data. We don't really know
how much of a gene product is necessary for a biological effect anyway,
and moreover the amount of active protein in a cell is dependent on a
lot more than just number of mRNA molecules and we have no way of
looking at that with a microarray. So i think the shape of the curves
are more important than the amplitude.
Furthermore, if i were you, i would subtract the log values of the
ref-t0 comparison from all other ref-tx comparisons in your first
dataset so that the values in your two different datasets are
comparable and reflect gene regulation compared to timepoint 0. It
would make it easier to get your head around what the numbers on your
screen actually mean.
This is all from a biologist so consult with a mathematician as well!
Hope this is of use to you.
Floor
_______________________________________________________
Floor Stam
Vrije Universiteit Amsterdam
Faculty of Earth and Life Sciences
Department of Molecular and Cellular Neurobiology
De Boelelaan 1085
1081HV Amsterdam
The Netherlands
Ph: +31-20-4447114
+31-20-5665512
Fax: +31-20-4447112
e-mail: fjstam at bio.vu.nl
_______________________________________________________
On 2 Sep 2004 , at 17:38, michael watson (IAH-C) wrote:
> Hi
>
> I have two different data sets, both time-courses. One uses a common
> reference for the Cy3 channel, the other performs direct comparisons
> between treated/untreated samples at each time-point. In both cases
> the
> actual data is log2(Cy5/Cy3).
>
> After a bit of thought, I've come to the conclusion that as a distance
> measure for the first dataset I will use "1 - pearson correlation
> coefficient". However, for the second dataset, as we performed direct
> comparisons at each time-point, using the correlation coefficient is
> not
> appropriate, so have decided to use euclidean distance.
>
> Does anyone have experience of what the best distance measure to use is
> for time-courses where direct comparisons are made at each time-point?
>
> Cheers
> Mick
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
More information about the Bioconductor
mailing list