[BioC] RNASeq: normalization issues

Alicia Oshlack oshlack at wehi.EDU.AU
Fri May 6 12:18:59 CEST 2011


Hi Joao,

I must say we are in the early stages of dealing with this issue so I'm
reluctant to give advice on an issue that we haven't fully explored yet.
So far we really only observe this issue in very large datasets. When I
have some firmer results I'll let you know. Sorry I can't be more help.

Cheers,
Alicia


> Thank you all for your opinions.
>
> Alicia, can you give me some tips on how are you thinking of doing that?
>
> Best regards,
>
> On Mon, May 2, 2011 at 12:36 PM, Alicia Oshlack <oshlack at wehi.edu.au>
> wrote:
>
>> > Send Bioconductor mailing list submissions to
>> >       bioconductor at r-project.org
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> >       https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > or, via email, send a message with subject or body 'help' to
>> >       bioconductor-request at r-project.org
>> >
>> > You can reach the person managing the list at
>> >       bioconductor-owner at r-project.org
>> >
>> > When replying, please edit your Subject line so it is more specific
>> > than "Re: Contents of Bioconductor digest..."
>> >
>> >
>> Hi,
>>
>> Just to get back to the original question I tend to agree with
>> Wolfgang.
>> If you are looking for correlations between genes the correlations
>> will
>> be length biased with longer gene pairs getting higher correlation
>> values. This is a different test to differential expression. I believe
>> we can correct the correlation estimation itself rather than
>> correcting
>> expression values using something like RPKM. I believe that using RPKM
>> will not remove length bias in correlations between genes like it does
>> not remove length bias in DE testing. We are currently working on a
>> way
>> to correct correlations.
>>
>> Cheers,
>> Alicia
>>
>>
>> > Date: Sun, 1 May 2011 21:34:43 -0400 (EDT)
>> > From: ywchen at jimmy.harvard.edu
>> > To: "Wei Shi" <shi at wehi.EDU.AU>
>> > Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
>> > Subject: Re: [BioC] RNASeq: normalization issues
>> > Message-ID:
>> >       <51046.155.52.45.41.1304300083.squirrel at roaming.dfci.harvard.edu>
>> > Content-Type: text/plain;charset=iso-8859-1
>> >
>> > Thanks.
>> >> Hi Yiwen:
>> >>
>> >>      It is a single factor experiment with six libraries. There
>> were
>> four
>> >> cell
>> >> types in this experiment, one of which had three replicates and
>> others
>> >> did not have replicates.
>> >>
>> >> Cheers,
>> >> Wei
>> >>
>> >> On May 2, 2011, at 11:21 AM, ywchen at jimmy.harvard.edu wrote:
>> >>
>> >>> Hi Wei and Davis,
>> >>>
>> >>> Thank you so much for such detailed explanations! Now it is very
>> >>> clear.
>> >>> In your case you found the benefit of using quantile
>> >>> normalization+GLM+LRT,
>> >>> is it single factor with many libraries or multiple factor data?
>> >>>
>> >>> Yiwen
>> >>>
>> >>>
>> >>>> Hi Yiwen:
>> >>>>
>> >>>>    As Davis said, the "length+quantile" method I mentioned in the
>> >>>> previous
>> >>>> correspondences is not the "quantile normalization" option in
>> >>>> calcNormFactors function in edgeR. That's the reason why you
>> didn't
>> >>>> see
>> >>>> gene length adjustment with that function.
>> >>>>
>> >>>>    Adjusting read counts using gene length (total exon length)
>> will
>> >>>> put
>> >>>> all
>> >>>> genes on the same baseline within the sample (longer transcripts
>> >>>> produce
>> >>>> more reads), and quantile between-sample normalization will make
>> all
>> >>>> samples have the same read count distribution (and library size
>> will
>> >>>> become the same as well). This is what I mean by
>> "length+quantile"
>> >>>> normalization. The quantile normalization here is the same
>> quantile
>> >>>> normalization applied to microarray data, however it is applied
>> to
>> >>>> sequencing data in a different way (used as offsets in the
>> general
>> >>>> linear
>> >>>> model).
>> >>>>
>> >>>>    Now I elaborate how to do this normalization. Suppose you have
>> a
>> >>>> read
>> >>>> count matrix of "x" with rows being genes and columns being
>> samples
>> >>>> .
>> >>>> Also suppose you have a numeric vector "gene.length" which
>> includes
>> >>>> total
>> >>>> exon length for each gene and gene order in "gene.length" is the
>> >>>> same
>> >>>> with that in "x". The following line of code yields the number of
>> >>>> reads
>> >>>> per 1000 bases for each gene:
>> >>>>
>> >>>> x1 <- x*1000/gene.length
>> >>>>
>> >>>> Now perform quantile normalization for gene length adjusted data:
>> >>>>
>> >>>> library(limma)
>> >>>> x2 <- normalizeBetweenArrays(x1,method="quantile")
>> >>>>
>> >>>> Suppose x has two columns named "wt" and "ko". Create a design
>> >>>> matrix:
>> >>>>
>> >>>> snames <- factor(c("wt","ko"))
>> >>>> design <- model.matrix(~snames)
>> >>>>
>> >>>> Now get the offsets for each gene in each sample. The offsets are
>> >>>> the
>> >>>> intensity differences between raw data and normalized data.
>> >>>>
>> >>>> library(edgeR)
>> >>>> y <- DGEList(counts=x,group=colnames(x))
>> >>>> lowcounts <- rowSums(x)<5
>> >>>> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1)
>> >>>> yf <- y[!lowcounts,]
>> >>>>
>> >>>> Fit general linear models to read count data with offsets
>> included:
>> >>>>
>> >>>> y.glm <-
>> estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE,
>> >>>> tagwise=TRUE)
>> >>>> fit <-
>> >>>>
>> glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion,offset=offset)
>> >>>>
>> >>>> Perform likelihood ratio tests to find differentially expressed
>> >>>> genes:
>> >>>>
>> >>>> DE <- glmLRT(y.glm,fit)
>> >>>> dt <- decideTestsDGE(DE)
>> >>>> summary(dt)
>> >>>>
>> >>>> Hope this will work for you!
>> >>>>
>> >>>> Cheers,
>> >>>> wei
>> >>>>
>> >>>>
>> >>>>
>> >>>> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote:
>> >>>>
>> >>>>> Hi Yiwen
>> >>>>>
>> >>>>> The "quantile normalization" option in calcNormFactors in edgeR
>> >>>>> does
>> >>>>> something very different from the quantile normalization
>> >>>>> (microarray-style) that Wei has been discussing.
>> >>>>>
>> >>>>> The quantile normalization in calcNormFactors computes an offset
>> >>>>> for
>> >>>>> sequencing library depth after Bullard et al (2010) [1]. This is
>> an
>> >>>>> approach in the same vein as TMM normalization [2] or scaled
>> median
>> >>>>> [3].
>> >>>>>
>> >>>>> I believe that the approach that Wei is suggesting is more
>> similar
>> >>>>> to
>> >>>>> the quantile normalization approach that has been taken with
>> >>>>> microarray
>> >>>>> data, adjusting the data so that the response follows the same
>> >>>>> distribution across (in this context) sequenced libraries. This
>> >>>>> will
>> >>>>> typically result in non-integer data from adjusting counts, but
>> >>>>> count-based methods could still be used if this quantile
>> >>>>> normalization
>> >>>>> were treated as an offset for each observation in (e.g.) a
>> >>>>> generalized
>> >>>>> linear model.
>> >>>>>
>> >>>>> Cheers
>> >>>>> Davis
>> >>>>>
>> >>>>>
>> >>>>> [1] http://www.biomedcentral.com/1471-2105/11/94
>> >>>>> [2]
>> >>>>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed
>> >>>>> [3] http://genomebiology.com/2010/11/10/R106#B13
>> >>>>>
>> >>>>>
>> >>>>>> Hi Wei,
>> >>>>>>
>> >>>>>> Could you elaborate on how to appropriately do
>> >>>>>> gene-length-adjusted
>> >>>>>> quantile normalization in edgeR? The "quantile normalization"
>> >>>>>> option
>> >>>>>> in
>> >>>>>> calcNormFactors function does not seem to take into account the
>> >>>>>> gene
>> >>>>>> length.
>> >>>>>>
>> >>>>>> Thanks.
>> >>>>>> Yiwen
>> >>>>>>> Hi Jo?o:
>> >>>>>>>
>> >>>>>>>         Maybe you can try different normalization methods for
>> your
>> data
>> >>>>>>> to
>> >>>>>>> see
>> >>>>>>> which one looks better. How to best normalize RNA-seq data is
>> >>>>>>> still
>> >>>>>>> of
>> >>>>>>> much debate at this stage.
>> >>>>>>>
>> >>>>>>>         You can try scaling methods like TMM, RPKM, or 75th
>> percentile,
>> >>>>>>> which
>> >>>>>>> as
>> >>>>>>> you said normalize data within samples. Or you can try
>> quantile
>> >>>>>>> between-sample normalization (read counts should be adjusted
>> by
>> >>>>>>> gene
>> >>>>>>> length first), which performs normalization across samples.
>> You
>> >>>>>>> can
>> >>>>>>> try
>> >>>>>>> all these in edgeR package.
>> >>>>>>>
>> >>>>>>>         From my experience, I actually found the quantile
>> method
>> >>>>>>> performed
>> >>>>>>> better
>> >>>>>>> for my RNA-seq data. I used general linear model and
>> likelihood
>> >>>>>>> ratio
>> >>>>>>> test in edgeR in my analysis.
>> >>>>>>>
>> >>>>>>>         Hope this helps.
>> >>>>>>>
>> >>>>>>> Cheers,
>> >>>>>>> Wei
>> >>>>>>>
>> >>>>>>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote:
>> >>>>>>>
>> >>>>>>>> Dear all,
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Until now I was doing RNAseq DE analysis and to do that I
>> >>>>>>>> understand
>> >>>>>>>> that
>> >>>>>>>> normalization issues only matter inside samples, because one
>> can
>> >>>>>>>> assume
>> >>>>>>>> the
>> >>>>>>>> length/content biases will cancel out when comparing same
>> genes
>> >>>>>>>> in
>> >>>>>>>> different
>> >>>>>>>> samples.
>> >>>>>>>> Although, I'm now trying to compare correlation of different
>> >>>>>>>> genes
>> >>>>>>>> and
>> >>>>>>>> so,
>> >>>>>>>> this biases should be taken into account - for this is there
>> any
>> >>>>>>>> better
>> >>>>>>>> method than RPKM?
>> >>>>>>>>
>> >>>>>>>> My main doubt is if I should also take into acount the biases
>> >>>>>>>> inside
>> >>>>>>>> samples
>> >>>>>>>> and to do that is there any better approach then TMM by
>> Robinson
>> >>>>>>>> and
>> >>>>>>>> Oshlack
>> >>>>>>>> [2010]?
>> >>>>>>>>
>> >>>>>>>> Thank you all,
>> >>>>>>>> --
>> >>>>>>>> Jo?o Moura
>> >>>>>>>>
>> >>>>>>>>        [[alternative HTML version deleted]]
>> >>>>>>>>
>> >>>>>>>> _______________________________________________
>> >>>>>>>> Bioconductor mailing list
>> >>>>>>>> Bioconductor at r-project.org
>> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>>>>>> Search the archives:
>> >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> ______________________________________________________________________
>> >>>>>>> The information in this email is confidential and
>> >>>>>>> intend...{{dropped:6}}
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> Bioconductor mailing list
>> >>>>>>> Bioconductor at r-project.org
>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>>>>> Search the archives:
>> >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> Bioconductor mailing list
>> >>>>>> Bioconductor at r-project.org
>> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>>>> Search the archives:
>> >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>> --------------------------------------------------
>> >>>>> Davis J McCarthy
>> >>>>> Research Technician
>> >>>>> Bioinformatics Division
>> >>>>> Walter and Eliza Hall Institute of Medical Research
>> >>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> >>>>> dmccarthy at wehi.edu.au
>> >>>>> http://www.wehi.edu.au
>> >>>>
>> >>>>
>> >>>> ______________________________________________________________________
>> >>>> The information in this email is confidential and intended solely
>> >>>> for
>> >>>> the
>> >>>> addressee.
>> >>>> You must not disclose, forward, print or use it without the
>> >>>> permission
>> >>>> of
>> >>>> the sender.
>> >>>> ______________________________________________________________________
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >> ______________________________________________________________________
>> >> The information in this email is confidential and
>> inte...{{dropped:8}}
>> >
>> >
>> >
>> > ------------------------------
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >
>> >
>> > End of Bioconductor Digest, Vol 99, Issue 2
>> > *******************************************
>> >
>>
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and
>> intend...{{dropped:4}}
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>
> --
> João Moura
>



______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list