[BioC] scholarly reference for "don't draw PCA/heatmap dendrograms on DEGs"

Paul Geeleher paulgeeleher at gmail.com
Mon Dec 9 18:55:21 CET 2013

Hey Aaron, you can show this fairly easily with a couple of lines of
code (using randomly generated data). I think Kevin suggested
something like this too:

mat <- rnorm(100000) # generate a 10 x 10,000 matrix of random "gene
expression" data
dim(mat) <- c(10000, 10)
myfac <- factor(c(rep("a", 5), rep("b", 5)))
tOut <- rowttests(mat, myfac)
sigInd <- order(tOut[,3])[1:1000]
pcOut <- prcomp(t(mat[sigInd, ]))$x # only plot PCA using top 1000
"differentially expressed" genes
plot(pcOut, col=myfac)


On Mon, Dec 9, 2013 at 10:18 AM, Aaron Mackey <ajmackey at gmail.com> wrote:
> On Mon, Dec 9, 2013 at 10:38 AM, Cook, Malcolm <MEC at stowers.org> wrote:
>> Have you done either on ALL (not just DE) genes?  If so, do your
>> replicates cluster?  Further, if so, do the distances between replicate
>> clusters scale in any interesting way with condition (i.e. higher dose or
>> better knockdown or longer exposure -> further away from untreated).   I
>> think this can be taken as "evidence" for condition effects that you and
>> your colleague should expect.  Do you agree with this?
> In my experience, I do occassionally see "global" (all genes) clustering in
> (*scaled* and centered) PCA that corresponds to experimental conditions;
> and in such cases I will also find a vast multitude of DEGs (and also
> brings up the spectre of whether the usual between-sample normalization
> assumptions are being violated, and whether there may be unequal variances
> between groups).  Or to consider the situation a different way, when a
> small number of DEGs exhibit a very large magnitude of variance, then an
> *unscaled* global PCA may also show experimental clustering (again, just
> driven by the variance of those DEGs).  FYI, there are methods (such as
> implemented in the superpc package) that use the PCA loadings of PCs
> correlated to experimental design to select DEGs.  It's all quite circular.
> Either way, the presence/absence of sample clustering in PCA does not
> provide any more/less independent evidence of treatment effects not already
> captured by the DEGs themselves, and so I usually argue that such
> "DEG-focused" PCA representations are not particularly informative (or at
> least no more informative than some representation of the DEGs themselves).
>  We use the global PCA for QC discovery/confirmation of sample outliers,
> non-experimental batch effects, etc., but not for evaluation of the
> experimental axes of interest.
> -Aaron
>         [[alternative HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Dr. Paul Geeleher, PhD
Section of Hematology-Oncology
Department of Medicine
The University of Chicago
900 E. 57th St.,
KCBD, Room 7144
Chicago, IL 60637

More information about the Bioconductor mailing list