[BioC] How to compute the stabilized variance in DESeq?

Peng Yu pengyu.ut at gmail.com
Tue Apr 10 14:49:40 CEST 2012


On Tue, Apr 10, 2012 at 3:50 AM, Simon Anders <anders at embl.de> wrote:
> Dear Peng
>
>
> On 04/09/2012 11:46 PM, Peng Yu wrote:
>>
>> Hi Simon,
>>
>> I see the resVarA&  resVarB columns, which is the ratio between the
>>
>> stabilized variance and the actual variance. To get the stabilized
>> variance, is the following R code correct?
>>
>> var(values_A)*resVarA
>
>
> No, for several reasons:
>
> - assuming that values_A is a matrix, 'var' gives you a single value
> computed over all values. You want 'rowVars'
>
> - resVarA/B was the ratio of per-gene variance estimate and fitted variance.
> Both are estimates, neither is an 'actual' value in the sense of a
> population value. You might be confusing things with the
> 'varianze-stabilizing transformation', because there is no quantity in DESeq
> that we call the 'stabilized variance'. Hence, I don't know what you are
> actually looking for.

The purpose is to plot the means and variances (of the normalized
count per sample) of a few genes, so that people can easily see the
changes or no changes in these genes. I understand that the variances
of the normalized count per sample is not directly used in the
statistical test (as in (14) of the GB paper, the variances of the sum
of normalized count are used).

I should have used 'fitted variance' if that is the terminology.
Essentially, the fitted mean vs dispersion curve is fitted. Based on
the mean of each gene (of course of the same condition), the fitted
dispersion for that gene is computed. Using this fitted dispersion and
the mean, the "fitted variance" is computed. Right? (Just to double
check my understand, as I read this long time ago.)

Given the following formula, shown in the vignette.

v = s μ + α s^2 μ^2

To make sure, v means the variance of the raw counts (i.e., (3) in the
GB paper). So the fitted variance of the normalized count for a given
condition (when the size factors are close to 1 (which is usual), the
distribution of the normalized count (hence its mean and variance) of
the same gene across samples in the same condition can be consider the
same) should be just μ + α μ^2? (If using maximum sharingMode, the
variance will be the actual sample variance, if it is above the fitted
value (roughly speaking), right?)

> - You are using an outdated version of DESeq. In the current version, the
> resVarA/B columns have been removed. See the new vignette for details and
> reasons for this.

It is hard for me find which paragraph/page you referred, would you
please point it to me?

> See also the description of the new 'fitInfo' function in the vignette,
> which might give you what you want.

The fitted mean dispersion function is now a parametric function. The
vignette and ?estimateDispersions do not say why this is better than
locfit. It seems that given tens of thousands of genes in a typical
genome, locfit is more likely going to be better than parametric. Is
there a rational why parametric becomes the default? Otherwise, to be
backward compatible, I think that locfit should still be the default.

> cds=makeExampleCountDataSet()
> cds=estimateSizeFactors( cds )
> cds=estimateDispersions( cds )
> conditions(cds)
A1 A2 B1 B2 B3
 A  A  B  B  B
Levels: A B
> x$dispFunc
function (q)
coefs[1] + coefs[2]/q
<environment: 0x10878b270>
attr(,"coefficients")
asymptDisp  extraPois
 0.2202342  0.6951945
attr(,"fitType")
[1] "parametric"

There are two conditions A and B, according to (3) in the GB paper,
v_i,rho(j) should be specific to the condition rho. Why there is only
one dispFunc? Should there be two such functions? Or you actually
assumed v_i (without dependence on rho)?

Also, since there is already a "maximum" option of for sharingMode
(which, I think, was not available in the original version), why not
add an additional option to allow shrinkage (as in usual Bayesian
analysis)?

-- 
Regards,
Peng



More information about the Bioconductor mailing list