[BioC] normalizing time course RNA-Seq data

Mark Robinson mark.robinson at imls.uzh.ch
Sun Jan 1 22:53:49 CET 2012


Hi Anand,

Some comments injected below ...

On 28.12.2011, at 10:50, AKSR wrote:

> Hi all,
> 
> I have some RNA-Seq data:
> 4 reps per sample, 4 different genotypes & 9 time points
> = 144 data points
> 
> I want to essentially know the best method to normalize across 
> ALL time points  and for each INDIVIDUAL  genotype.
> Is the state of the art normalization method today, TMM?

I'm not sure if TMM is "best", but it can certainly improve things.  Basically, the whole idea with TMM is that naively using totals of mapped reads can bias differential expression, since different experimental conditions can express different "repertoires".

> If yes, is TMM step-by-step procedure available any where?
> (I do some Perl scripting, but I am pretty new to R)

TMM is available in edgeR's calcNormFactors() function.

> I realize that edgeR might be using TMM for pair-wise 
> comparison,  but I need to perform normalization across 
> time points for each genotype. 
> Irrespective of normalization strategy, will I have to choose
> the base level sample aka reference for normalization?
> Or can normalization be done independent of an 
> overtly defined reference state? 
> - I know this is a naive question, sorry...
> (If required, I would use time point zero as my reference state)

With TMM, you can manually define what reference sample to use, or the default is to leave it unspecified … the docs for calcNormFactors() says:

----
If ‘refColumn’ is unspecified, the library whose upper quartile is
     closest to the mean upper quartile is used.
----

While TMM is pairwise in nature, it may work just fine this way across your genotypes and time points.  I think it's worth trying it and looking at "smear" plots -- plotSmear() in edgeR -- between some of your time points (of the same genotype, say), just to see whether the normalization factors are aligning the M values.  There are other normalization strategies implemented too, that are not explicitly pairwise -- see ?calcNormFactors.  For example, method="RLE", as proposed by the DESeq authors:

----
     ‘method="RLE"’ is the scaling factor method proposed by Anders and
     Huber (2010). We call it "relative log expression", as median
     library is calculated from the geometric mean of all columns and
     the median ratio of each sample to the median library is taken as
     the scale factor.
----

As well, people are actively considering this problem in other directions (e.g. GC content). For example:

http://www.bioconductor.org/packages/release/bioc/html/cqn.html
http://www.biomedcentral.com/1471-2105/12/480/abstract

Hope that helps,
Mark


> Thanks in advance for guiding me
> AKSR
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list