[BioC] DESeq variance stabilisation and clustering

Wed Mar 23 11:10:20 CET 2011

Hi Timothy

On 03/23/2011 10:47 AM, Timothy Hughes wrote:
> We wish to perform clustering on expression data and therefore are
> interested in the variance-stabilizing transformation of DESeq. I understand
> what the purpose of the transformation is namely to produce values whose
> variances are approximately the same, but why is it necessary to do this
> when computing the distance between two values? Or put another way, in what
> way does hierarchical clustering make assumptions about similar variances?
>
> I believe I have the answer, but it would be nice if someone could confirm
> this.
>
> When doing clustering one is often effectively trying to minimize the
> variance within a cluster even if this is not explicitly defined. If we
> consider that the observations being clustered are random variables with a
> variance then we should explicitly account for this variance and use a
> variance stabilising transformation. This avoids the need for trying to
> account for the variance in the clustering process.
[...]

When talking about clustering, it is important to get clear on what you 
are clustering: samples or genes?

In the DESeq vignette, I am clustering samples, i.e., I want to see 
which samples are similar to each other, hoping to find that replicate 
samples appear more similar than samples from different conditions. For 
this, I need to measure of distance between samples. To compare to 
samples, one usually takes the two vectors with the expression values of 
all genes in the respective sample and calculates the distance between 
these vectors. If one uses Euclidean distance, one calculated, for each 
gene, the difference of expression between the two samples, squares all 
these differences, adds up the squares and takes the square root.

You want all genes to have roughly equal influence on the distance, and 
for this, all genes should have equal variance. If you use raw counts, 
the variance of the top ten-or-so most strongly expressed genes have so 
much more variance that all the other genes have hardly any influence. 
DESeq's VST rectifies this.

So, my motivation to add the VST to DESeq was to give the user a 
possibility to calculate distances about

You seem to be talking about clustering genes, not samples, however. I 
hd not thought yet about this application, but I think, your explanation 
goes the right way.

As strong genes have strong variance in all samples, all samples will 
contribute equally to any measure of distance between two genes. So, we 
don't have the issue I just discussed that different components 
influencing the distance have unequal weight. However, the variance of 
the distance measure itself is now vastly different between weak and 
strong genes. Two strong genes which actually behave similarly will not 
cluster together because their large values will give amplify the noise 
contributions to the distance, while two weak genes will always have 
small distance because their small expression values also lets their 
distance appear small. Again, the VST changes the scales such that 
typical distances (as difference, not ratio) between genes become 
independent of overall expression strength.

   Simon