[BioC] RNASeq: normalization issues

Mon May 2 12:36:05 CEST 2011

> Send Bioconductor mailing list submissions to
> 	bioconductor at r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://stat.ethz.ch/mailman/listinfo/bioconductor
> or, via email, send a message with subject or body 'help' to
> 	bioconductor-request at r-project.org
>
> You can reach the person managing the list at
> 	bioconductor-owner at r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Bioconductor digest..."
>
>
Hi,

Just to get back to the original question I tend to agree with Wolfgang.
If you are looking for correlations between genes the correlations will
be length biased with longer gene pairs getting higher correlation
values. This is a different test to differential expression. I believe
we can correct the correlation estimation itself rather than correcting
expression values using something like RPKM. I believe that using RPKM
will not remove length bias in correlations between genes like it does
not remove length bias in DE testing. We are currently working on a way
to correct correlations.

Cheers,
Alicia

> Date: Sun, 1 May 2011 21:34:43 -0400 (EDT)
> From: ywchen at jimmy.harvard.edu
> To: "Wei Shi" <shi at wehi.EDU.AU>
> Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
> Subject: Re: [BioC] RNASeq: normalization issues
> Message-ID:
> 	<51046.155.52.45.41.1304300083.squirrel at roaming.dfci.harvard.edu>
> Content-Type: text/plain;charset=iso-8859-1
>
> Thanks.
>> Hi Yiwen:
>>
>> 	It is a single factor experiment with six libraries. There were four
>> cell
>> types in this experiment, one of which had three replicates and others
>> did not have replicates.
>>
>> Cheers,
>> Wei
>>
>> On May 2, 2011, at 11:21 AM, ywchen at jimmy.harvard.edu wrote:
>>
>>> Hi Wei and Davis,
>>>
>>> Thank you so much for such detailed explanations! Now it is very
>>> clear.
>>> In your case you found the benefit of using quantile
>>> normalization+GLM+LRT,
>>> is it single factor with many libraries or multiple factor data?
>>>
>>> Yiwen
>>>
>>>
>>>> Hi Yiwen:
>>>>
>>>> 	As Davis said, the "length+quantile" method I mentioned in the
>>>> previous
>>>> correspondences is not the "quantile normalization" option in
>>>> calcNormFactors function in edgeR. That's the reason why you didn't
>>>> see
>>>> gene length adjustment with that function.
>>>>
>>>> 	Adjusting read counts using gene length (total exon length) will
>>>> put
>>>> all
>>>> genes on the same baseline within the sample (longer transcripts
>>>> produce
>>>> more reads), and quantile between-sample normalization will make all
>>>> samples have the same read count distribution (and library size will
>>>> become the same as well). This is what I mean by "length+quantile"
>>>> normalization. The quantile normalization here is the same quantile
>>>> normalization applied to microarray data, however it is applied to
>>>> sequencing data in a different way (used as offsets in the general
>>>> linear
>>>> model).
>>>>
>>>> 	Now I elaborate how to do this normalization. Suppose you have a
>>>> read
>>>> count matrix of "x" with rows being genes and columns being samples
>>>> .
>>>> Also suppose you have a numeric vector "gene.length" which includes
>>>> total
>>>> exon length for each gene and gene order in "gene.length" is the
>>>> same
>>>> with that in "x". The following line of code yields the number of
>>>> reads
>>>> per 1000 bases for each gene:
>>>>
>>>> x1 <- x*1000/gene.length
>>>>
>>>> Now perform quantile normalization for gene length adjusted data:
>>>>
>>>> library(limma)
>>>> x2 <- normalizeBetweenArrays(x1,method="quantile")
>>>>
>>>> Suppose x has two columns named "wt" and "ko". Create a design
>>>> matrix:
>>>>
>>>> snames <- factor(c("wt","ko"))
>>>> design <- model.matrix(~snames)
>>>>
>>>> Now get the offsets for each gene in each sample. The offsets are
>>>> the
>>>> intensity differences between raw data and normalized data.
>>>>
>>>> library(edgeR)
>>>> y <- DGEList(counts=x,group=colnames(x))
>>>> lowcounts <- rowSums(x)<5
>>>> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1)
>>>> yf <- y[!lowcounts,]
>>>>
>>>> Fit general linear models to read count data with offsets included:
>>>>
>>>> y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE,
>>>> tagwise=TRUE)
>>>> fit <-
>>>> glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion,offset=offset)
>>>>
>>>> Perform likelihood ratio tests to find differentially expressed
>>>> genes:
>>>>
>>>> DE <- glmLRT(y.glm,fit)
>>>> dt <- decideTestsDGE(DE)
>>>> summary(dt)
>>>>
>>>> Hope this will work for you!
>>>>
>>>> Cheers,
>>>> wei
>>>>
>>>>
>>>>
>>>> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote:
>>>>
>>>>> Hi Yiwen
>>>>>
>>>>> The "quantile normalization" option in calcNormFactors in edgeR
>>>>> does
>>>>> something very different from the quantile normalization
>>>>> (microarray-style) that Wei has been discussing.
>>>>>
>>>>> The quantile normalization in calcNormFactors computes an offset
>>>>> for
>>>>> sequencing library depth after Bullard et al (2010) [1]. This is an
>>>>> approach in the same vein as TMM normalization [2] or scaled median
>>>>> [3].
>>>>>
>>>>> I believe that the approach that Wei is suggesting is more similar
>>>>> to
>>>>> the quantile normalization approach that has been taken with
>>>>> microarray
>>>>> data, adjusting the data so that the response follows the same
>>>>> distribution across (in this context) sequenced libraries. This
>>>>> will
>>>>> typically result in non-integer data from adjusting counts, but
>>>>> count-based methods could still be used if this quantile
>>>>> normalization
>>>>> were treated as an offset for each observation in (e.g.) a
>>>>> generalized
>>>>> linear model.
>>>>>
>>>>> Cheers
>>>>> Davis
>>>>>
>>>>>
>>>>> [1] http://www.biomedcentral.com/1471-2105/11/94
>>>>> [2]
>>>>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed
>>>>> [3] http://genomebiology.com/2010/11/10/R106#B13
>>>>>
>>>>>
>>>>>> Hi Wei,
>>>>>>
>>>>>> Could you elaborate on how to appropriately do
>>>>>> gene-length-adjusted
>>>>>> quantile normalization in edgeR? The "quantile normalization"
>>>>>> option
>>>>>> in
>>>>>> calcNormFactors function does not seem to take into account the
>>>>>> gene
>>>>>> length.
>>>>>>
>>>>>> Thanks.
>>>>>> Yiwen
>>>>>>> Hi Jo?o:
>>>>>>>
>>>>>>> 	Maybe you can try different normalization methods for your data
>>>>>>> to
>>>>>>> see
>>>>>>> which one looks better. How to best normalize RNA-seq data is
>>>>>>> still
>>>>>>> of
>>>>>>> much debate at this stage.
>>>>>>>
>>>>>>> 	You can try scaling methods like TMM, RPKM, or 75th percentile,
>>>>>>> which
>>>>>>> as
>>>>>>> you said normalize data within samples. Or you can try quantile
>>>>>>> between-sample normalization (read counts should be adjusted by
>>>>>>> gene
>>>>>>> length first), which performs normalization across samples. You
>>>>>>> can
>>>>>>> try
>>>>>>> all these in edgeR package.
>>>>>>>
>>>>>>> 	From my experience, I actually found the quantile method
>>>>>>> performed
>>>>>>> better
>>>>>>> for my RNA-seq data. I used general linear model and likelihood
>>>>>>> ratio
>>>>>>> test in edgeR in my analysis.
>>>>>>>
>>>>>>> 	Hope this helps.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Wei
>>>>>>>
>>>>>>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote:
>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>>
>>>>>>>> Until now I was doing RNAseq DE analysis and to do that I
>>>>>>>> understand
>>>>>>>> that
>>>>>>>> normalization issues only matter inside samples, because one can
>>>>>>>> assume
>>>>>>>> the
>>>>>>>> length/content biases will cancel out when comparing same genes
>>>>>>>> in
>>>>>>>> different
>>>>>>>> samples.
>>>>>>>> Although, I'm now trying to compare correlation of different
>>>>>>>> genes
>>>>>>>> and
>>>>>>>> so,
>>>>>>>> this biases should be taken into account - for this is there any
>>>>>>>> better
>>>>>>>> method than RPKM?
>>>>>>>>
>>>>>>>> My main doubt is if I should also take into acount the biases
>>>>>>>> inside
>>>>>>>> samples
>>>>>>>> and to do that is there any better approach then TMM by Robinson
>>>>>>>> and
>>>>>>>> Oshlack
>>>>>>>> [2010]?
>>>>>>>>
>>>>>>>> Thank you all,
>>>>>>>> --
>>>>>>>> Jo?o Moura
>>>>>>>>
>>>>>>>> 	[[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>> ______________________________________________________________________
>>>>>>> The information in this email is confidential and
>>>>>>> intend...{{dropped:6}}
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------
>>>>> Davis J McCarthy
>>>>> Research Technician
>>>>> Bioinformatics Division
>>>>> Walter and Eliza Hall Institute of Medical Research
>>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>> dmccarthy at wehi.edu.au
>>>>> http://www.wehi.edu.au
>>>>
>>>>
>>>> ______________________________________________________________________
>>>> The information in this email is confidential and intended solely
>>>> for
>>>> the
>>>> addressee.
>>>> You must not disclose, forward, print or use it without the
>>>> permission
>>>> of
>>>> the sender.
>>>> ______________________________________________________________________
>>>>
>>>
>>>
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and inte...{{dropped:8}}
>
>
>
> ------------------------------
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
>
> End of Bioconductor Digest, Vol 99, Issue 2
> *******************************************
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}