[BioC] scholarly reference for "don't draw PCA/heatmap dendrograms on DEGs"

Mon Dec 9 16:56:33 CET 2013

These papers don't show clustered heatmaps, but show the inflation of
classification accuracy and survival discrimination in simulated
no-signal data when using differentially expressed genes only.  So if
you consider your clustering as the classifier, they may be relevant:

Simon RM, Subramanian J, Li M-C, Menezes S. Using cross-validation to
evaluate predictive accuracy of survival risk classifiers based on
high-dimensional data. Brief Bioinform. 2011 May 15;12(3):203–14.

Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of
DNA microarray data for diagnostic and prognostic classification. J
Natl Cancer Inst. 2003 Jan 1;95(1):14–8.

On Mon, Dec 9, 2013 at 9:05 AM, Kevin Coombes <kevin.r.coombes at gmail.com> wrote:
> I don't have a good reference either.
>
> But you can easily simulate matrices full of IID standard normal data, pick
> the "most differentially expressed" and show that this noise/nonsense
> perfectly separates any two "groups" that you want to pretend is present in
> the data.
>
>   -- Kevin
>
>
> On 12/9/2013 8:55 AM, Lorena Pantano wrote:
>>
>> Hi,
>>
>> I don't have any reference to give you.
>>
>> But my experience says that you don't get necessary a good heatmap
>> separated by two conditions although you use only DE genes. Probably
>> because many time,s results from DE genes are not so strong to separate
>> the
>> two groups, or because there is a systematically outlier in your
>> comparison
>> and get DE genes that are not true, or any other reason.
>>
>> I can say that I have done more than 50 DE analysis, and only once, I got
>> a
>> clear heatmap showing two groups. So, I guess there is something there.
>>
>> very interesting your initiative.
>>
>> cheers
>>
>> Lo
>>
>>
>> On Mon, Dec 9, 2013 at 2:19 PM, Aaron Mackey <ajmackey at gmail.com> wrote:
>>
>>> A colleague of mine is skeptical of my assertion that drawing
>>> sample-level
>>> PCA plots and/or clustered heatmaps based only on differentially
>>> expressed
>>> genes (DEGs) is a circular, self-fulfilling prophecy -- they assert that
>>> there's no guarantee samples will cluster by condition (despite the fact
>>> that the condition is exactly what drives selection of DEGs), and so
>>> hopes
>>> to use the observed clustering as further "evidence" of the condition
>>> effects.  Rather than spend more time trying to explain statistical
>>> concepts, I was hoping to checkmate the argument with a nice Nature
>>> Methods
>>> review or somesuch.  Any pointers?
>>>
>>> Thanks in advance,
>>> -Aaron
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor