[BioC] significance of "wrong" clustering of differential genes

Sean Davis sdavis2 at mail.nih.gov
Tue Nov 14 00:29:33 CET 2006


In addition to Naomi's comments, remember that a desired property of a 
statistic is that it be "robust" to outliers (ignoring them when 
appropriate).  I think it is probably fine to have some proportion of the 
samples "misclassified" by your clustering.  However, when this happens, it 
is a good idea to make sure that a sample mislabeling or some such thing has 
not occurred.  I have discovered an adult sample in what were supposed to be 
pediatric samples, a mouse cell line among what were supposed to be all 
canine, and other oddities like that by looking back at data.  Most of the 
time, though, these samples simply represent biological or technical 
variation that we cannot fully explain.

Sean

On Monday 13 November 2006 16:02, Naomi Altman wrote:
> The heatmap did not come through (to me).  However, clustering is
> highly dependent on the choice of distance measure.
>
> --Naomi
>
> At 09:57 AM 11/13/2006, Benjamin Otto wrote:
> >Hi,
> >
> >
> >
> >Please imagine the following situation:
> >
> >For two sample sets (set1, set2) the most differentially expressed genes
> > are identified by limma. The p.value correction would be "holm".
> > Afterwards a
> >
> >heatmap is printed for these genes. The procedure would look like:
> > >  f <- factor(as.character(pheno[,marker]))
> > >
> > > design <- model.matrix(~f)
> > >
> > > fit <- eBayes(lmFit(eSet,design))
> > >
> > > tab <- topTable(fit, coef=2, number=nrow(eSet), adjust.method="holm")
> > >
> > > selected <- tab$adj.P.Val < 0.01 & abs(tab$M) >= 1
> > >
> > > ## print a heatmap for eSet[selected,]
> >
> >What can  lead to a misclassification in the clustering, say one sample of
> >set1 is clustered together with set2? Afterall according to the workflow I
> >have explicitly been searching for the genes which should discriminate
> >between the two sets! However the expression values displayed in the
> > heatmap assume, that this samle IS more similar to the "wrong" set than
> > to the true one. (have a look at the jpg)
> >
> >Is it possible, that this sample is always treated as outlier in the
> >significance calculations?
> >
> >And if it is so, then: Is it sensible to take such a misclassification as
> >kind of significane?
> >
> >Regards
> >
> >
> >
> >Benjamin
> >
> >
> >
> >
> >
> >--
> >Benjamin Otto
> >Universitaetsklinikum Eppendorf Hamburg
> >Institut fuer Klinische Chemie
> >Martinistrasse 52
> >20246 Hamburg
> >
> >
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives:
> >http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> Naomi S. Altman                                814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics                              814-863-7114 (fax)
> Penn State University                         814-865-1348 (Statistics)
> University Park, PA 16802-2111
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list