[BioC] RNASeq: normalization issues

ywchen at jimmy.harvard.edu ywchen at jimmy.harvard.edu
Mon May 2 03:34:43 CEST 2011


Thanks.
> Hi Yiwen:
>
> 	It is a single factor experiment with six libraries. There were four cell
> types in this experiment, one of which had three replicates and others
> did not have replicates.
>
> Cheers,
> Wei
>
> On May 2, 2011, at 11:21 AM, ywchen at jimmy.harvard.edu wrote:
>
>> Hi Wei and Davis,
>>
>> Thank you so much for such detailed explanations! Now it is very clear.
>> In your case you found the benefit of using quantile
>> normalization+GLM+LRT,
>> is it single factor with many libraries or multiple factor data?
>>
>> Yiwen
>>
>>
>>> Hi Yiwen:
>>>
>>> 	As Davis said, the "length+quantile" method I mentioned in the
>>> previous
>>> correspondences is not the "quantile normalization" option in
>>> calcNormFactors function in edgeR. That's the reason why you didn't see
>>> gene length adjustment with that function.
>>>
>>> 	Adjusting read counts using gene length (total exon length) will put
>>> all
>>> genes on the same baseline within the sample (longer transcripts
>>> produce
>>> more reads), and quantile between-sample normalization will make all
>>> samples have the same read count distribution (and library size will
>>> become the same as well). This is what I mean by "length+quantile"
>>> normalization. The quantile normalization here is the same quantile
>>> normalization applied to microarray data, however it is applied to
>>> sequencing data in a different way (used as offsets in the general
>>> linear
>>> model).
>>>
>>> 	Now I elaborate how to do this normalization. Suppose you have a read
>>> count matrix of "x" with rows being genes and columns being samples .
>>> Also suppose you have a numeric vector "gene.length" which includes
>>> total
>>> exon length for each gene and gene order in "gene.length" is the same
>>> with that in "x". The following line of code yields the number of reads
>>> per 1000 bases for each gene:
>>>
>>> x1 <- x*1000/gene.length
>>>
>>> Now perform quantile normalization for gene length adjusted data:
>>>
>>> library(limma)
>>> x2 <- normalizeBetweenArrays(x1,method="quantile")
>>>
>>> Suppose x has two columns named "wt" and "ko". Create a design matrix:
>>>
>>> snames <- factor(c("wt","ko"))
>>> design <- model.matrix(~snames)
>>>
>>> Now get the offsets for each gene in each sample. The offsets are the
>>> intensity differences between raw data and normalized data.
>>>
>>> library(edgeR)
>>> y <- DGEList(counts=x,group=colnames(x))
>>> lowcounts <- rowSums(x)<5
>>> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1)
>>> yf <- y[!lowcounts,]
>>>
>>> Fit general linear models to read count data with offsets included:
>>>
>>> y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE,
>>> tagwise=TRUE)
>>> fit <-
>>> glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion,offset=offset)
>>>
>>> Perform likelihood ratio tests to find differentially expressed genes:
>>>
>>> DE <- glmLRT(y.glm,fit)
>>> dt <- decideTestsDGE(DE)
>>> summary(dt)
>>>
>>> Hope this will work for you!
>>>
>>> Cheers,
>>> wei
>>>
>>>
>>>
>>> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote:
>>>
>>>> Hi Yiwen
>>>>
>>>> The "quantile normalization" option in calcNormFactors in edgeR does
>>>> something very different from the quantile normalization
>>>> (microarray-style) that Wei has been discussing.
>>>>
>>>> The quantile normalization in calcNormFactors computes an offset for
>>>> sequencing library depth after Bullard et al (2010) [1]. This is an
>>>> approach in the same vein as TMM normalization [2] or scaled median
>>>> [3].
>>>>
>>>> I believe that the approach that Wei is suggesting is more similar to
>>>> the quantile normalization approach that has been taken with
>>>> microarray
>>>> data, adjusting the data so that the response follows the same
>>>> distribution across (in this context) sequenced libraries. This will
>>>> typically result in non-integer data from adjusting counts, but
>>>> count-based methods could still be used if this quantile normalization
>>>> were treated as an offset for each observation in (e.g.) a generalized
>>>> linear model.
>>>>
>>>> Cheers
>>>> Davis
>>>>
>>>>
>>>> [1] http://www.biomedcentral.com/1471-2105/11/94
>>>> [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed
>>>> [3] http://genomebiology.com/2010/11/10/R106#B13
>>>>
>>>>
>>>>> Hi Wei,
>>>>>
>>>>> Could you elaborate on how to appropriately do gene-length-adjusted
>>>>> quantile normalization in edgeR? The "quantile normalization" option
>>>>> in
>>>>> calcNormFactors function does not seem to take into account the gene
>>>>> length.
>>>>>
>>>>> Thanks.
>>>>> Yiwen
>>>>>> Hi João:
>>>>>>
>>>>>> 	Maybe you can try different normalization methods for your data to
>>>>>> see
>>>>>> which one looks better. How to best normalize RNA-seq data is still
>>>>>> of
>>>>>> much debate at this stage.
>>>>>>
>>>>>> 	You can try scaling methods like TMM, RPKM, or 75th percentile,
>>>>>> which
>>>>>> as
>>>>>> you said normalize data within samples. Or you can try quantile
>>>>>> between-sample normalization (read counts should be adjusted by gene
>>>>>> length first), which performs normalization across samples. You can
>>>>>> try
>>>>>> all these in edgeR package.
>>>>>>
>>>>>> 	From my experience, I actually found the quantile method performed
>>>>>> better
>>>>>> for my RNA-seq data. I used general linear model and likelihood
>>>>>> ratio
>>>>>> test in edgeR in my analysis.
>>>>>>
>>>>>> 	Hope this helps.
>>>>>>
>>>>>> Cheers,
>>>>>> Wei
>>>>>>
>>>>>> On Apr 28, 2011, at 7:36 PM, João Moura wrote:
>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>>
>>>>>>> Until now I was doing RNAseq DE analysis and to do that I
>>>>>>> understand
>>>>>>> that
>>>>>>> normalization issues only matter inside samples, because one can
>>>>>>> assume
>>>>>>> the
>>>>>>> length/content biases will cancel out when comparing same genes in
>>>>>>> different
>>>>>>> samples.
>>>>>>> Although, I'm now trying to compare correlation of different genes
>>>>>>> and
>>>>>>> so,
>>>>>>> this biases should be taken into account - for this is there any
>>>>>>> better
>>>>>>> method than RPKM?
>>>>>>>
>>>>>>> My main doubt is if I should also take into acount the biases
>>>>>>> inside
>>>>>>> samples
>>>>>>> and to do that is there any better approach then TMM by Robinson
>>>>>>> and
>>>>>>> Oshlack
>>>>>>> [2010]?
>>>>>>>
>>>>>>> Thank you all,
>>>>>>> --
>>>>>>> João Moura
>>>>>>>
>>>>>>> 	[[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>>>> ______________________________________________________________________
>>>>>> The information in this email is confidential and
>>>>>> intend...{{dropped:6}}
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>> --------------------------------------------------
>>>> Davis J McCarthy
>>>> Research Technician
>>>> Bioinformatics Division
>>>> Walter and Eliza Hall Institute of Medical Research
>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>> dmccarthy at wehi.edu.au
>>>> http://www.wehi.edu.au
>>>
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for
>>> the
>>> addressee.
>>> You must not disclose, forward, print or use it without the permission
>>> of
>>> the sender.
>>> ______________________________________________________________________
>>>
>>
>>
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:8}}



More information about the Bioconductor mailing list