[BioC] EdgeR and libsize normalization

Wed Aug 1 16:15:28 CEST 2012

Hi François,

Please don't take emails off the list (I've re-copied the list here).  Also, this document may help you:
http://bioconductor.org/help/mailing-list/posting-guide/

> Could you confirm (or not) what I am saying here?

Unfortunately, I haven't been able to decode what you are doing or are trying to do.  I've added a couple comments and suggestions below.

Best regards,
Mark

On 31.07.2012, at 09:07, François RICHARD wrote:

> Hi Mark,
> 
> Thanks a lot for your reply. I would try to be a bit clearer,
> 
> I have noticed that the TMM normalization add an offset to the model
> and the equalizedlibsize is called automaticly but I wanted to see the
> impact of both normalizations on the DE analysis.
> 
> Indeed, when I am running the DE analysis on my experiment, some of
> the genes which have very different number of reads between two
> conditions (top 30 delta (on raw counts) between condition A and B)
> are not called as DE genes. I wanted to see whether those genes were
> smoothed by one of the normalizations, if so which one.

Maybe give some examples?

> I have tried with or without TMM but those genes are still not
> recognized as DEs and the libsize normalization does not seem to have
> a huge impact on the counts (both library sizes are quite similar
> already).

Give more details.  What are your lib sizes, norm factors?

> Then I have try to characterised a bit more those "expecting DE genes"
> and realized that they were highly expressed as well. So maybe they
> are not called as DE just because of the negative binomial model used
> in the analysis?

Depends how large the dispersion is, no?  Maybe show your estimates of dispersion?  In any exercise of differential expression, it's really about the change in the mean, relative to the variability (roughly).

> Even if I am using the common.dispersion it does not
> mean that the variability is set to the same value for all of the
> genes, is that right?
> I read (but I can not remember where) that the variability was set to
> : var = mean(mean + comm.disp*mean) for each gene. Is that right?

If using common dispersion, yes.

> In such case, it would be normal that highly expressed genes would
> need a really big delta between two conditions to be called as DE
> genes (and would explain what I observed in my analysis).

Bit hard to tell with "delta".  We typically look at log-fold-changes.  But, regardless, this conversation would be enriched if you gave examples and details.

> 
> Could you confirm (or not) what I am saying here?
> 
> Thanks a lot,
> Kind regards,
> 
> François
> 
> 
> 
> 2012/7/24 Mark Robinson <mark.robinson at imls.uzh.ch>:
>> Hi Francois,
>> 
>> I'm a little confused as to what you are asking.
>> 
>> You asked a similar question last week:
>> https://stat.ethz.ch/pipermail/bioconductor/2012-July/047057.html
>> 
>> 
>>> Correct me if I am wrong but to have the counts after TMM I am doing :
>>> TMM_counts = raw_counts / ( libsize * norm.factor )
>> 
>> These are normalized "values", but they are, of course, no longer counts.  You could multiply by 1e6 and have a (normalized) counts per million interpretation.
>> 
>> 
>>> But how to get the counts after TMM and lib.size normalization ?
>> 
>> The edgeR user's guide says:
>> 
>> "The edgeR methodology needs to work with the original digital expression counts, so these should not be transformed in any way by users prior to analysis."
>> 
>> And, in your own words:
>> 
>> "[TMM] gives a normalization factor that will correspond to an offset in
>> the model that will test for differential expressed genes."
>> 
>> So, what do you actually mean be "counts after […] normalization"?  The normalization doesn't actually change the raw counts; it changes the offset in the model.
>> 
>> 
>>> Calling equalizedLibSizes(d) give me a common libsize (N value)
>>> But I am not sure how to rescale the normalise factor obtain on the raw counts.
>> 
>> equalizeLibSizes() gets used only for the purpose of estimating the dispersion parameter, and generally does not need to be called directly.  Do you have a reason for calling it directly?
>> 
>> Best,
>> Mark
>> 
>> 
>> 
>> ----------
>> Prof. Dr. Mark Robinson
>> Bioinformatics
>> Institute of Molecular Life Sciences
>> University of Zurich
>> Winterthurerstrasse 190
>> 8057 Zurich
>> Switzerland
>> 
>> v: +41 44 635 4848
>> f: +41 44 635 6898
>> e: mark.robinson at imls.uzh.ch
>> o: Y11-J-16
>> w: http://tiny.cc/mrobin
>> 
>> ----------
>> http://www.fgcz.ch/Bioconductor2012
>> http://www.eccb12.org/t5
>> 
>> 
>> 
>> On 24.07.2012, at 10:58, François RICHARD wrote:
>> 
>>> Dear all,
>>> I am using EdgeR on RNA-seq data for differential analysis.
>>> 
>>> I would like to see the impact of the double normalizations (TMM +
>>> libsize) on the counts.
>>> 
>>> Correct me if I am wrong but to have the counts after TMM I am doing :
>>> TMM_counts = raw_counts / ( libsize * norm.factor )
>>> 
>>> But how to get the counts after TMM and lib.size normalization ?
>>> Calling equalizedLibSizes(d) give me a common libsize (N value)
>>> But I am not sure how to rescale the normalise factor obtain on the raw counts.
>>> 
>>> Can someone help me?
>>> 
>>> Thanks a lot
>>> 
>>> François
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>