[BioC] Tissue heterogeneity and TMM normalization

Ni Feng fengni99 at gmail.com
Tue Sep 9 16:30:36 CEST 2014


Thank you Wolfgang!
We are using fold change >4 and FDR corrected P value of  <0.001 as
thresholds for calling differential expression, do you think this is
stringent enough given our skew?

It was hard for me to gauge just how bad the skew is and that was another
thing I wanted to get an opinion on.

Yesterday I took out lowly expressed transcripts (<0.1 FPKM in any sample),
which gave me a small dispersion value akin to what Trinity uses as default
(0.1), but using the normalization factors from this dataset did not
improve the skew. Given what Ryan Thompson said earlier I guess this makes
sense.

I had only used CEGs to calculate the dispersion, but will try to get the
normalization factors from them and see how well it works. Thanks for the
suggestion!
If this doesn't work, I'll try the quantile normalization.

Thanks again for your help!
Jenny

---------- Forwarded message ----------
From: Wolfgang Huber <whuber at embl.de>
Date: Tue, Sep 9, 2014 at 3:58 AM
Subject: Re: [BioC] Tissue heterogeneity and TMM normalization
To: Ni Feng <fengni99 at gmail.com>
Cc: bioconductor at r-project.org


Hi Ni

the ‘most genes are not differentially expressed’ is a sufficient
assumption that one can use to prove that the estimated normalisation
factor is close to the true one, under some model. It is not a necessary
assumption, TMM or similar normalisations can still be useful beyond (e.g.
if many genes are d.e. but up and down are about balanced; etc.)

Did you try compouting the normalisation parameters from the CEG genes only
and then applying to all data?

An interesting idea was put forward by J. Li, D. M. Witten, I. M. Johnstone
and R. Tibshirani: Normalization, testing, and false discovery rate
estimation for RNA-sequencing data. Biostatistics, 13:523 (2012) —
www.biostat.washington.edu/~dwitten/Papers/LiWittenJohnstoneTibs.pdf
They determine the normalisation factor so as to minimize the amount of
differential expression.
(This is one instance of this idea I am aware of, it’s been put out for
microarrays before, apologies to anyone else who proposed this.)

Also, if I understood your plots correctly, the biases are relatively small
in amplitude. So you could leave them there, but apply a banded hypothesis
test (i.e. H0: |beta| < theta) rather than H0: beta=0, where beta is the
fold change and theta a positive number. This is, e.g., described in the
DESeq2 vignette.

Best wishes
        Wolfgang


Il giorno 08 Sep 2014, alle ore 18:15, Ni Feng <fengni99 at gmail.com> ha
scritto:

> Dear all,
> I have a general question about whether TMM normalization is appropriate
> for my data. I apologize for this long winded email. I am not a trained
> bioinformatician and therefore have been struggling with some data
> analysis.
>
> A colleague and I did an RNA seq experiment with 6 samples (each had RNA
> pooled from 6 individuals) and no biological replicates. The 6 samples
> included 2 tissue types collected at 3 different time points. I know that
> this is not an ideal experimental set-up, we did this experiment 3 years
> ago.
>
> We used the Trinity package to do most of the transcriptome assembly and
> downstream analyses, such as leveraging EdgeR for differential expression.
> Naively I went on with all downstream analyses without verifying whether
my
> data violated underlying assumptions of TMM normalization.
>
> For example, we found ~30% of our transcripts showed differential
> expression between any 2 pairwise comparisons. Does this violate the TMM
> assumption that most genes are NOT differentially expressed?
>
> Furthermore, we noticed that there is still a tissue bias after
> normalization. Attached is a scatterplot of TMM normalized values for each
> tissue (summed across 3 sample groups for each tissue). Plotted in black
on
> top of all transcripts are CEG (Core Eukaryotic Genes) expression, which
we
> believe should be good candidates for "house keeping" genes. Both CEGs and
> all genes show that at higher expression levels, there is a skew towards
> one tissue ("VMN"), whereas in the middle values, there is a skew towards
> the other tissue ("H").
>
> I have also attached a density plot of the M values, and a MA plot to
> visualize the skew. These plots were generated from 1 pair of tissue
> comparisons ("SMH" vs "SMV).
>
> These plots reflect the fact that one tissue is more heterogeneous than
the
> other. Although TMM normalization is designed to deal with this problem,
> our data seems to need further normalization. Our within tissue
comparisons
> are great and do not show this kind of skew. My questions are:
>
> 1) does our data violate TMM normalization assumptions
> 2) do you have another normalization method to suggest for our data
> 3) should we just forget about tissue-comparisons
>
> I have also played around with the suggestions about estimating a
> dispersion value based on the EdgeR user guide. Can discuss this further.
>
> Thank you for your time and patience, and any advice is much appreciated.
>
> --
> Ni (Jenny) Ye Feng
> Ph.D. Candidate
> Bass Laboratory
> Cornell University
> Dept of Neurobiology and Behavior
> Ithaca, NY 14853
>
<CEG_FPKM_over_all_090814.png><SMV_SMH_density_log2(M).pdf><SMH_SMV_MA_plot_0903.png>_______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor




-- 
Ni (Jenny) Ye Feng
Ph.D. Candidate
Bass Laboratory
Cornell University
Dept of Neurobiology and Behavior
Ithaca, NY 14853

	[[alternative HTML version deleted]]



More information about the Bioconductor mailing list