[BioC] Tissue heterogeneity and TMM normalization

Tue Sep 9 19:38:39 CEST 2014

Thank you Davide.
I'll definitely give it a try and let you know if I bump into any
questions.

In addition, as a follow up to Wolfgang Huber's suggestion, I've attached a
graph showing the tissue comparisons after normalizing based on CEG derived
normalization factors. I will try these other normalization methods people
have suggested until I feel confident about the skew.

Best,
Jenny

On Tue, Sep 9, 2014 at 12:48 PM, davide risso <risso.davide at gmail.com>
wrote:

> Hi Jenny,
>
> you may also want to have a look at our new RUVSeq package. In
> particular, you can use the RUVg function to estimate factors of
> "unwanted variation" (UV) using the CEG genes as "negative controls."
>
> This is not equivalent to estimate the TMM normalization factors on a
> subset of genes (which doesn't work too well in our experience),
> because our UV factors are included in the model with some parameters
> (coefficients) that are then re-estimated for all the genes. Have a
> look at the vignette of RUVSeq package for details and let me know if
> you have questions.
>
> Best,
> davide
>
>
>
>
> On Tue, Sep 9, 2014 at 7:30 AM, Ni Feng <fengni99 at gmail.com> wrote:
> > Thank you Wolfgang!
> > We are using fold change >4 and FDR corrected P value of  <0.001 as
> > thresholds for calling differential expression, do you think this is
> > stringent enough given our skew?
> >
> > It was hard for me to gauge just how bad the skew is and that was another
> > thing I wanted to get an opinion on.
> >
> > Yesterday I took out lowly expressed transcripts (<0.1 FPKM in any
> sample),
> > which gave me a small dispersion value akin to what Trinity uses as
> default
> > (0.1), but using the normalization factors from this dataset did not
> > improve the skew. Given what Ryan Thompson said earlier I guess this
> makes
> > sense.
> >
> > I had only used CEGs to calculate the dispersion, but will try to get the
> > normalization factors from them and see how well it works. Thanks for the
> > suggestion!
> > If this doesn't work, I'll try the quantile normalization.
> >
> > Thanks again for your help!
> > Jenny
> >
> > ---------- Forwarded message ----------
> > From: Wolfgang Huber <whuber at embl.de>
> > Date: Tue, Sep 9, 2014 at 3:58 AM
> > Subject: Re: [BioC] Tissue heterogeneity and TMM normalization
> > To: Ni Feng <fengni99 at gmail.com>
> > Cc: bioconductor at r-project.org
> >
> >
> > Hi Ni
> >
> > the ‘most genes are not differentially expressed’ is a sufficient
> > assumption that one can use to prove that the estimated normalisation
> > factor is close to the true one, under some model. It is not a necessary
> > assumption, TMM or similar normalisations can still be useful beyond
> (e.g.
> > if many genes are d.e. but up and down are about balanced; etc.)
> >
> > Did you try compouting the normalisation parameters from the CEG genes
> only
> > and then applying to all data?
> >
> > An interesting idea was put forward by J. Li, D. M. Witten, I. M.
> Johnstone
> > and R. Tibshirani: Normalization, testing, and false discovery rate
> > estimation for RNA-sequencing data. Biostatistics, 13:523 (2012) —
> > www.biostat.washington.edu/~dwitten/Papers/LiWittenJohnstoneTibs.pdf
> > They determine the normalisation factor so as to minimize the amount of
> > differential expression.
> > (This is one instance of this idea I am aware of, it’s been put out for
> > microarrays before, apologies to anyone else who proposed this.)
> >
> > Also, if I understood your plots correctly, the biases are relatively
> small
> > in amplitude. So you could leave them there, but apply a banded
> hypothesis
> > test (i.e. H0: |beta| < theta) rather than H0: beta=0, where beta is the
> > fold change and theta a positive number. This is, e.g., described in the
> > DESeq2 vignette.
> >
> > Best wishes
> >         Wolfgang
> >
> >
> > Il giorno 08 Sep 2014, alle ore 18:15, Ni Feng <fengni99 at gmail.com> ha
> > scritto:
> >
> >> Dear all,
> >> I have a general question about whether TMM normalization is appropriate
> >> for my data. I apologize for this long winded email. I am not a trained
> >> bioinformatician and therefore have been struggling with some data
> >> analysis.
> >>
> >> A colleague and I did an RNA seq experiment with 6 samples (each had RNA
> >> pooled from 6 individuals) and no biological replicates. The 6 samples
> >> included 2 tissue types collected at 3 different time points. I know
> that
> >> this is not an ideal experimental set-up, we did this experiment 3 years
> >> ago.
> >>
> >> We used the Trinity package to do most of the transcriptome assembly and
> >> downstream analyses, such as leveraging EdgeR for differential
> expression.
> >> Naively I went on with all downstream analyses without verifying whether
> > my
> >> data violated underlying assumptions of TMM normalization.
> >>
> >> For example, we found ~30% of our transcripts showed differential
> >> expression between any 2 pairwise comparisons. Does this violate the TMM
> >> assumption that most genes are NOT differentially expressed?
> >>
> >> Furthermore, we noticed that there is still a tissue bias after
> >> normalization. Attached is a scatterplot of TMM normalized values for
> each
> >> tissue (summed across 3 sample groups for each tissue). Plotted in black
> > on
> >> top of all transcripts are CEG (Core Eukaryotic Genes) expression, which
> > we
> >> believe should be good candidates for "house keeping" genes. Both CEGs
> and
> >> all genes show that at higher expression levels, there is a skew towards
> >> one tissue ("VMN"), whereas in the middle values, there is a skew
> towards
> >> the other tissue ("H").
> >>
> >> I have also attached a density plot of the M values, and a MA plot to
> >> visualize the skew. These plots were generated from 1 pair of tissue
> >> comparisons ("SMH" vs "SMV).
> >>
> >> These plots reflect the fact that one tissue is more heterogeneous than
> > the
> >> other. Although TMM normalization is designed to deal with this problem,
> >> our data seems to need further normalization. Our within tissue
> > comparisons
> >> are great and do not show this kind of skew. My questions are:
> >>
> >> 1) does our data violate TMM normalization assumptions
> >> 2) do you have another normalization method to suggest for our data
> >> 3) should we just forget about tissue-comparisons
> >>
> >> I have also played around with the suggestions about estimating a
> >> dispersion value based on the EdgeR user guide. Can discuss this
> further.
> >>
> >> Thank you for your time and patience, and any advice is much
> appreciated.
> >>
> >> --
> >> Ni (Jenny) Ye Feng
> >> Ph.D. Candidate
> >> Bass Laboratory
> >> Cornell University
> >> Dept of Neurobiology and Behavior
> >> Ithaca, NY 14853
> >>
> >
> <CEG_FPKM_over_all_090814.png><SMV_SMH_density_log2(M).pdf><SMH_SMV_MA_plot_0903.png>_______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
> >
> >
> > --
> > Ni (Jenny) Ye Feng
> > Ph.D. Candidate
> > Bass Laboratory
> > Cornell University
> > Dept of Neurobiology and Behavior
> > Ithaca, NY 14853
> >
> >         [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Davide Risso, PhD
> Post Doctoral Scholar
> Department of Statistics
> University of California, Berkeley
> 344 Li Ka Shing Center, #3370
> Berkeley, CA 94720-3370
> E-mail: davide.risso at berkeley.edu
>

-- 
Ni (Jenny) Ye Feng
Ph.D. Candidate
Bass Laboratory
Cornell University
Dept of Neurobiology and Behavior
Ithaca, NY 14853
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CEG_normalized_allseqs_090914.png
Type: image/png
Size: 93911 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140909/94c14246/attachment.png>