[BioC] RNASeq: normalization issues

Wei Shi shi at wehi.EDU.AU
Mon May 2 02:33:28 CEST 2011


Hi Yiwen:

	As Davis said, the "length+quantile" method I mentioned in the previous correspondences is not the "quantile normalization" option in calcNormFactors function in edgeR. That's the reason why you didn't see gene length adjustment with that function.

	Adjusting read counts using gene length (total exon length) will put all genes on the same baseline within the sample (longer transcripts produce more reads), and quantile between-sample normalization will make all samples have the same read count distribution (and library size will become the same as well). This is what I mean by "length+quantile" normalization. The quantile normalization here is the same quantile normalization applied to microarray data, however it is applied to sequencing data in a different way (used as offsets in the general linear model). 

	Now I elaborate how to do this normalization. Suppose you have a read count matrix of "x" with rows being genes and columns being samples . Also suppose you have a numeric vector "gene.length" which includes total exon length for each gene and gene order in "gene.length" is the same with that in "x". The following line of code yields the number of reads per 1000 bases for each gene:

x1 <- x*1000/gene.length

Now perform quantile normalization for gene length adjusted data:

library(limma)
x2 <- normalizeBetweenArrays(x1,method="quantile")

Suppose x has two columns named "wt" and "ko". Create a design matrix:

snames <- factor(c("wt","ko"))
design <- model.matrix(~snames)

Now get the offsets for each gene in each sample. The offsets are the intensity differences between raw data and normalized data.

library(edgeR)
y <- DGEList(counts=x,group=colnames(x)) 
lowcounts <- rowSums(x)<5 
offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1) 
yf <- y[!lowcounts,] 

Fit general linear models to read count data with offsets included:

y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE, tagwise=TRUE) 
fit <- glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion,offset=offset)

Perform likelihood ratio tests to find differentially expressed genes: 

DE <- glmLRT(y.glm,fit) 
dt <- decideTestsDGE(DE) 
summary(dt)

Hope this will work for you!

Cheers,
wei



On May 2, 2011, at 8:52 AM, Davis McCarthy wrote:

> Hi Yiwen
> 
> The "quantile normalization" option in calcNormFactors in edgeR does something very different from the quantile normalization (microarray-style) that Wei has been discussing.
> 
> The quantile normalization in calcNormFactors computes an offset for sequencing library depth after Bullard et al (2010) [1]. This is an approach in the same vein as TMM normalization [2] or scaled median [3].  
> 
> I believe that the approach that Wei is suggesting is more similar to the quantile normalization approach that has been taken with microarray data, adjusting the data so that the response follows the same distribution across (in this context) sequenced libraries. This will typically result in non-integer data from adjusting counts, but count-based methods could still be used if this quantile normalization were treated as an offset for each observation in (e.g.) a generalized linear model. 
> 
> Cheers
> Davis
> 
> 
> [1] http://www.biomedcentral.com/1471-2105/11/94 
> [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed
> [3] http://genomebiology.com/2010/11/10/R106#B13
> 
> 
>> Hi Wei,
>> 
>> Could you elaborate on how to appropriately do gene-length-adjusted
>> quantile normalization in edgeR? The "quantile normalization" option in
>> calcNormFactors function does not seem to take into account the gene
>> length.
>> 
>> Thanks.
>> Yiwen
>>> Hi João:
>>> 
>>> 	Maybe you can try different normalization methods for your data to see
>>> which one looks better. How to best normalize RNA-seq data is still of
>>> much debate at this stage.
>>> 
>>> 	You can try scaling methods like TMM, RPKM, or 75th percentile, which
>>> as
>>> you said normalize data within samples. Or you can try quantile
>>> between-sample normalization (read counts should be adjusted by gene
>>> length first), which performs normalization across samples. You can try
>>> all these in edgeR package.
>>> 
>>> 	From my experience, I actually found the quantile method performed
>>> better
>>> for my RNA-seq data. I used general linear model and likelihood ratio
>>> test in edgeR in my analysis.
>>> 
>>> 	Hope this helps.
>>> 
>>> Cheers,
>>> Wei
>>> 
>>> On Apr 28, 2011, at 7:36 PM, João Moura wrote:
>>> 
>>>> Dear all,
>>>> 
>>>> 
>>>> Until now I was doing RNAseq DE analysis and to do that I understand
>>>> that
>>>> normalization issues only matter inside samples, because one can assume
>>>> the
>>>> length/content biases will cancel out when comparing same genes in
>>>> different
>>>> samples.
>>>> Although, I'm now trying to compare correlation of different genes and
>>>> so,
>>>> this biases should be taken into account - for this is there any better
>>>> method than RPKM?
>>>> 
>>>> My main doubt is if I should also take into acount the biases inside
>>>> samples
>>>> and to do that is there any better approach then TMM by Robinson and
>>>> Oshlack
>>>> [2010]?
>>>> 
>>>> Thank you all,
>>>> --
>>>> João Moura
>>>> 
>>>> 	[[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> 
>>> ______________________________________________________________________
>>> The information in this email is confidential and intend...{{dropped:6}}
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> 
> --------------------------------------------------
> Davis J McCarthy
> Research Technician
> Bioinformatics Division
> Walter and Eliza Hall Institute of Medical Research
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> dmccarthy at wehi.edu.au
> http://www.wehi.edu.au


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list