[BioC] edgeR -- gene expression variability

Thu Jan 5 01:10:23 CET 2012

Dear Miguel,

I'm afraid that I don't understand your questions.  There is no quantity 
in edgeR called "Con", there is no sensible way that I know of to 
normalize counts using the dispersion, nor any need to do so, and I do not 
follow for what quantity you are trying to obtain a confidence interval.

I would prefer that you did a little more background reading before 
sending more questions.  The three papers by Mark Robinson and myself 
about edgeR might help, and there's plenty of public documentation on the 
coefficient of variation:

  http://en.wikipedia.org/wiki/Coefficient_of_variation

The dispersion is a coefficient of variation is always dimensionless, 
because CV=sd/mean and the dimensions of the sd and the mean cancel out.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Wed, 4 Jan 2012, Miguel Gallach wrote:

> Sorry again Gordon,
>
> In addition to the previous question, what is the unit of dispersion. I
> mean, the dispersion is calculated for the logCon, Con or counts? This
> should be important if I want to calculate confidence intervals, right?
> In addition, why logCon != log2(Conc)? This happens when I apply myself the
> log2 (Conc), which is not exactly equal to the logCon provided by edgeR.
> Sorry for being so picky, but I really want to understand where do the data
> come from?
>
>
> Many thanks again and all the best,
> Miguel
>
> On Wed, Jan 4, 2012 at 9:06 AM, Miguel Gallach <
> miguel.gallach at vetmeduni.ac.at> wrote:
>
>> Dear Gordon,
>>
>> thanks so much for your answer.
>>
>> Here you have the version info:
>>
>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] splines   stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>> [1] limma_3.10.0 edgeR_2.4.1
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.14.0
>>
>>
>> I understand the problem of having only two replicates, but is the best I
>> can have. However, let me ask you another question: I found a negative
>> correlation between expression level and sqrt(dispersion). I think this is
>> kind of logical, so I just "normalized" the data by dividing
>> sqrt(dispersion)/expression. However, I did this thinking that
>> sqrt(dispersion) was a kind of s.d. But now, since you tell me that
>> sqrt(dispersion) is equivalent to sd/mean, I am not sure my normalization
>> is appropriate (I mean, I am dividing by mean express. twice.) Is my
>> interpretation correct?
>>
>>
>> Thanks again,
>> Miguel
>>
>>
>>
>>
>> On Wed, Jan 4, 2012 at 1:04 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>
>>> Dear Miguel,
>>>
>>> What you are doing seems correct.  Although of course expecting to get
>>> good estimates of genewise dispersions from just two libraries (one degree
>>> of freedom) is a bit optimistic.  edgeR tries to do the best that can be
>>> done.
>>>
>>> The edgeR manual tells you that the sqrt(dispersion) is the biological
>>> coefficient of variation.  Coefficient of variation means sd/mean rather
>>> than variance.  It is a more appropriate measure of variability than the
>>> standard deviation for quantities that are strictly positive.
>>>
>>> The reason why estimateTagwiseDisp() returns a limited number of distinct
>>> dispersions is that it maximizes the tagwise dispersions on a grid of 200
>>> possible dispersion values.  estimateGLMTagwiseDisp() does something
>>> similar, but adds an extra refinement step in which it interpolates a cubic
>>> spline through the grid values and maximizes the spline.  Hence the
>>> dispersion values from estimateTagwiseDisp() are taken from a (largish) set
>>> of preset values whereas those from estimateGLMTagwiseDisp() are always
>>> different.
>>>
>>> This has no major impact I think on a practical analysis.  Nevertheless
>>> we have modified estimateTagwiseDisp() on Bioc devel to work like
>>> estimateGLMTagwiseDisp(), so in future they with behave in a directly
>>> comparable way.
>>>
>>> Please give sessionInfo() output so that we can see what versions of the
>>> package you are using.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>  Date: Mon, 2 Jan 2012 13:40:59 +0100
>>>> From: Miguel Gallach <miguel.gallach at vetmeduni.ac.**at<miguel.gallach at vetmeduni.ac.at>
>>>>>
>>>> To: bioconductor at r-project.org
>>>> Subject: [BioC] edgeR -- gene expression variability
>>>>
>>>> Hi List,
>>>>
>>>> I am analyzing my RNA-Seq data with edgeR. The next is my experimental
>>>> design:
>>>>
>>>>
>>>> d.GLM
>>>> An object of class "DGEList"
>>>> $samples
>>>>                  group lib.size norm.factors
>>>> R4.Hot     HotAdaptedHot 17409289    0.9881635
>>>> R5.Hot     HotAdaptedHot 17642552    1.0818144
>>>> R9.Hot    ColdAdaptedHot 20010974    0.8621807
>>>> R10.Hot   ColdAdaptedHot 14064143    0.8932791
>>>> R4.Cold   HotAdaptedCold 11968317    1.0061084
>>>> R5.Cold   HotAdaptedCold 11072832    1.0523857
>>>> R9.Cold  ColdAdaptedCold 22386103    1.0520949
>>>> R10.Cold ColdAdaptedCold 17408532    1.0903311
>>>>
>>>>
>>>> As you can see, R4 and R5 are replicates of the same biological group
>>>> (Hot
>>>> adapted), and the same is true for R9 and R10 (Cold adapted).
>>>>
>>>> I am interested in measuring for each gene its expression variability
>>>> within a biological group (at each temperature) to discern genes that
>>>> might
>>>> be tightly regulated (or under stabilizing selection). The question in
>>>> particular is: How can I get tagwise dispersion values for the pairs
>>>> (R4.Hot + R5.Hot), (R9.Hot + R10.Hot), (R4.Cold + R5.Cold), (R9.Cold +
>>>> R10.Cold). I assume that the square root of each tagwise dispersion value
>>>> can be interpreted as the expression variance of the corresponding gene
>>>> (i.e., biological variation), as I understood from the edgeR manual. Am I
>>>> correct?
>>>>
>>>> I tried to calculate it like this:
>>>>
>>>> R4.R5.HC = edgeR_expressed_genes[,1:2]
>>>> #I tell edgeR there is only one factor, two replicates
>>>> group = factor(c("HC", "HC"))
>>>> Hot.Hot = DGEList(counts = R4.R5.HC, group = group)
>>>> Hot.Hot = calcNormFactors(Hot.Hot)
>>>> Hot.Hot = estimateCommonDisp(Hot.Hot)
>>>> Hot.Hot = estimateTagwiseDisp(Hot.Hot)
>>>>
>>>> (and similarly for (R9.Hot + R10.Hot), (R4.Cold + R5.Cold), (R9.Cold +
>>>> R10.Cold)).
>>>>
>>>> What I don't understand is why I just got 20 different dispersion values
>>>> for all genes:
>>>>
>>>> dim(table(Hot.Hot$tagwise.**dispersion))
>>>> [1] 20
>>>>
>>>> However, when I use the d.GLM dataset (i.e., the 8 samples for the 2x2
>>>> factor design) I get one different dispersion value for each gene:
>>>>
>>>>  dim(table(d.GLM1$tagwise.**dispersion))
>>>>>
>>>> [1] 9418
>>>>
>>>>
>>>> Why is this?
>>>>
>>>> Can I get gene expression variability in a better way to fulfill my aim?
>>>>
>>>>
>>>> Thank you very much,
>>>> Miguel Gallach
>>>>
>>
>> --
>> Miguel Gallach
>> Institut für Populationsgenetik
>> Veterinärmedizinische Universität Wien
>> Josef Baumann Gasse 1
>> 1210 Wien
>> Austria
>>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}