[BioC] Tissue heterogeneity and TMM normalization
fengni99 at gmail.com
Tue Sep 9 19:38:39 CEST 2014
Thank you Davide.
I'll definitely give it a try and let you know if I bump into any
In addition, as a follow up to Wolfgang Huber's suggestion, I've attached a
graph showing the tissue comparisons after normalizing based on CEG derived
normalization factors. I will try these other normalization methods people
have suggested until I feel confident about the skew.
On Tue, Sep 9, 2014 at 12:48 PM, davide risso <risso.davide at gmail.com>
> Hi Jenny,
> you may also want to have a look at our new RUVSeq package. In
> particular, you can use the RUVg function to estimate factors of
> "unwanted variation" (UV) using the CEG genes as "negative controls."
> This is not equivalent to estimate the TMM normalization factors on a
> subset of genes (which doesn't work too well in our experience),
> because our UV factors are included in the model with some parameters
> (coefficients) that are then re-estimated for all the genes. Have a
> look at the vignette of RUVSeq package for details and let me know if
> you have questions.
> On Tue, Sep 9, 2014 at 7:30 AM, Ni Feng <fengni99 at gmail.com> wrote:
> > Thank you Wolfgang!
> > We are using fold change >4 and FDR corrected P value of <0.001 as
> > thresholds for calling differential expression, do you think this is
> > stringent enough given our skew?
> > It was hard for me to gauge just how bad the skew is and that was another
> > thing I wanted to get an opinion on.
> > Yesterday I took out lowly expressed transcripts (<0.1 FPKM in any
> > which gave me a small dispersion value akin to what Trinity uses as
> > (0.1), but using the normalization factors from this dataset did not
> > improve the skew. Given what Ryan Thompson said earlier I guess this
> > sense.
> > I had only used CEGs to calculate the dispersion, but will try to get the
> > normalization factors from them and see how well it works. Thanks for the
> > suggestion!
> > If this doesn't work, I'll try the quantile normalization.
> > Thanks again for your help!
> > Jenny
> > ---------- Forwarded message ----------
> > From: Wolfgang Huber <whuber at embl.de>
> > Date: Tue, Sep 9, 2014 at 3:58 AM
> > Subject: Re: [BioC] Tissue heterogeneity and TMM normalization
> > To: Ni Feng <fengni99 at gmail.com>
> > Cc: bioconductor at r-project.org
> > Hi Ni
> > the ‘most genes are not differentially expressed’ is a sufficient
> > assumption that one can use to prove that the estimated normalisation
> > factor is close to the true one, under some model. It is not a necessary
> > assumption, TMM or similar normalisations can still be useful beyond
> > if many genes are d.e. but up and down are about balanced; etc.)
> > Did you try compouting the normalisation parameters from the CEG genes
> > and then applying to all data?
> > An interesting idea was put forward by J. Li, D. M. Witten, I. M.
> > and R. Tibshirani: Normalization, testing, and false discovery rate
> > estimation for RNA-sequencing data. Biostatistics, 13:523 (2012) —
> > www.biostat.washington.edu/~dwitten/Papers/LiWittenJohnstoneTibs.pdf
> > They determine the normalisation factor so as to minimize the amount of
> > differential expression.
> > (This is one instance of this idea I am aware of, it’s been put out for
> > microarrays before, apologies to anyone else who proposed this.)
> > Also, if I understood your plots correctly, the biases are relatively
> > in amplitude. So you could leave them there, but apply a banded
> > test (i.e. H0: |beta| < theta) rather than H0: beta=0, where beta is the
> > fold change and theta a positive number. This is, e.g., described in the
> > DESeq2 vignette.
> > Best wishes
> > Wolfgang
> > Il giorno 08 Sep 2014, alle ore 18:15, Ni Feng <fengni99 at gmail.com> ha
> > scritto:
> >> Dear all,
> >> I have a general question about whether TMM normalization is appropriate
> >> for my data. I apologize for this long winded email. I am not a trained
> >> bioinformatician and therefore have been struggling with some data
> >> analysis.
> >> A colleague and I did an RNA seq experiment with 6 samples (each had RNA
> >> pooled from 6 individuals) and no biological replicates. The 6 samples
> >> included 2 tissue types collected at 3 different time points. I know
> >> this is not an ideal experimental set-up, we did this experiment 3 years
> >> ago.
> >> We used the Trinity package to do most of the transcriptome assembly and
> >> downstream analyses, such as leveraging EdgeR for differential
> >> Naively I went on with all downstream analyses without verifying whether
> > my
> >> data violated underlying assumptions of TMM normalization.
> >> For example, we found ~30% of our transcripts showed differential
> >> expression between any 2 pairwise comparisons. Does this violate the TMM
> >> assumption that most genes are NOT differentially expressed?
> >> Furthermore, we noticed that there is still a tissue bias after
> >> normalization. Attached is a scatterplot of TMM normalized values for
> >> tissue (summed across 3 sample groups for each tissue). Plotted in black
> > on
> >> top of all transcripts are CEG (Core Eukaryotic Genes) expression, which
> > we
> >> believe should be good candidates for "house keeping" genes. Both CEGs
> >> all genes show that at higher expression levels, there is a skew towards
> >> one tissue ("VMN"), whereas in the middle values, there is a skew
> >> the other tissue ("H").
> >> I have also attached a density plot of the M values, and a MA plot to
> >> visualize the skew. These plots were generated from 1 pair of tissue
> >> comparisons ("SMH" vs "SMV).
> >> These plots reflect the fact that one tissue is more heterogeneous than
> > the
> >> other. Although TMM normalization is designed to deal with this problem,
> >> our data seems to need further normalization. Our within tissue
> > comparisons
> >> are great and do not show this kind of skew. My questions are:
> >> 1) does our data violate TMM normalization assumptions
> >> 2) do you have another normalization method to suggest for our data
> >> 3) should we just forget about tissue-comparisons
> >> I have also played around with the suggestions about estimating a
> >> dispersion value based on the EdgeR user guide. Can discuss this
> >> Thank you for your time and patience, and any advice is much
> >> --
> >> Ni (Jenny) Ye Feng
> >> Ph.D. Candidate
> >> Bass Laboratory
> >> Cornell University
> >> Dept of Neurobiology and Behavior
> >> Ithaca, NY 14853
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > --
> > Ni (Jenny) Ye Feng
> > Ph.D. Candidate
> > Bass Laboratory
> > Cornell University
> > Dept of Neurobiology and Behavior
> > Ithaca, NY 14853
> > [[alternative HTML version deleted]]
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> Davide Risso, PhD
> Post Doctoral Scholar
> Department of Statistics
> University of California, Berkeley
> 344 Li Ka Shing Center, #3370
> Berkeley, CA 94720-3370
> E-mail: davide.risso at berkeley.edu
Ni (Jenny) Ye Feng
Dept of Neurobiology and Behavior
Ithaca, NY 14853
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 93911 bytes
Desc: not available
More information about the Bioconductor