[BioC] heatmap with variance stabilizing transformed expression data in DESeq

Wed May 8 20:08:21 CEST 2013

Hi Li,

On Wed, May 8, 2013 at 9:54 AM, Wang, Li <li.wang at ttu.edu> wrote:
> Dear List Members
>
> I am a bit confused with the function of varianceStabilizingTransformation in DESeq.
>
> I used the function to transform my expression data of above four fold change differentially expressed genes, and applied heatmap.2 of gplots packages to generate the heatmap of transformed data. I found that the expressional difference between my two conditions after transformation turned to be smaller. In the manual of DESeq, the example figure about heatmap of transformed data also shows less color difference between two conditions.
>
> However, that is opposite to my purpose.  Could anyone give me some suggestion what kind of transformation of data I should do to show the expressional difference of two conditions?

A few quick points:

(1) I think you should rather load all of your data into a
CountDataSet, run the vst on the entire set, then subset out the ones
that you are interested for the heatmap.

(2) While the example in the DESeq vignette shows less color
difference, as you say, it also shows that the samples cluster
together in a manner that is more expected (samples from the same
condition cluster together only after the vst), so it seems like it's
doing "the right thing."

(3) Is your fold change calculated manually, or is it extracted from
the results from, say, DESeq?

(4) You might be interested in using DESeq2 here -- it's very easy to
do since you already have a count matrix. The introduce another
transform you can try (the rlogTransform) that you can play with.

(5) I have no idea how you did your analysis, but I'd be willing to
bet that this 4 logFC cut off you use is likely picking genes that are
more artifacts than a "real" 4 fold change, likely due to genes that
have low read counts in one (or both) conditions. For example, one
could get a 10x change in expression if condition A has 1 read, and
condition B has 10 -- or condition A has 100 and condition B has 1000
... you'd be more inclined to believe the latter than the former,
right? Naive approaches at estimated logFC's don't discriminate
between the two (combining logFC w/ pvalue tresholds helps to mitigate
this problem, tho)

If you try DESeq2 (the vignette is very easy to follow), the default
analysis returns shrunken log-fold changes that try to control for
scenarios like these. I believe other methods (in edgeR and
elsewhere?) also do similar things as well. Either way, I'd say it's
worth doing a bit more exploring with your data while keeping these
things in mind.

HTH,
-steve

--
Steve Lianoglou
Computational Biologist
Department of Bioinformatics and Computational Biology
Genentech