[BioC] RNA-seq differentially expressed gene finding methods

Fri Sep 5 19:31:25 CEST 2014

Thank you Richard,  Devon and Paul for very insight answers.
I completely agree that the approach I raised above is inappropriate when
the group size is small (3, 4...).
But when the group size is large enough ( > 20 or 30), the sampling
distribution of the mean will be (closed to) normally distributed, and that
is why I believe that the t-test is ok.

-Son.

On Fri, Sep 5, 2014 at 10:05 AM, Paul Geeleher <paulgeeleher at gmail.com>
wrote:

> Hi Son,
>
> My understanding is that the approach you describe could be considered
> valid for large enough numbers of samples, however, RNA-seq
> experiments will typically have smaller numbers (<30) samples per
> condition, meaning that a t-test is not valid (because RNA-seq data
> isn't normally distributed). However, while I don't think that a
> t-test is "invalid" given enough samples, its very difficult to
> justify using such a method when much better powered methods have been
> invented specifically for this type of data.
>
> Paul
>
> On Fri, Sep 5, 2014 at 11:52 AM, Richard Friedman
> <friedman at c2b2.columbia.edu> wrote:
> > Dear Son,
> >
> >         The t-test assumes a normal distribution,
> > which is appropriate for continous variables. RNAseq
> > data deals with counts (discrete entities). A negative binomial
> distribution
> > (EdgeR, Deseq) or a mean dependent variance (VOOM)
> > is much more approriate. Also the 3 methods mentioned
> > above estimate variablity better with information from all genes
> > using empirical Bayesian methods, than does the one-gene
> > at-a-time frequentist t-test.
> >
> > Best wishes,
> > Rich
> > Richard A. Friedman, PhD
> > Associate Research Scientist,
> > Biomedical Informatics Shared Resource
> > Herbert Irving Comprehensive Cancer Center (HICCC)
> > Lecturer,
> > Department of Biomedical Informatics (DBMI)
> > Educational Coordinator,
> > Center for Computational Biology and Bioinformatics (C2B2)/
> > National Center for Multiscale Analysis of Genomic Networks (MAGNet)/
> > Columbia Department of Systems Biology
> > Room 824
> > Irving Cancer Research Center
> > Columbia University
> > 1130 St. Nicholas Ave
> > New York, NY 10032
> > (212)851-4765 (voice)
> > friedman at c2b2.columbia.edu
> > http://friedman.c2b2.columbia.edu/
> >
> > "There is nothing in my Contemporary Jewish Literature course that is
> > either contemporary, Jewish, or literature".
> >
> > -Rose Friedman, age 17
> >
> >
> > On Sep 5, 2014, at 12:44 PM, Son Pham wrote:
> >
> >> Dear all,
> >> I know that we have quite very good packages (edgeR, deseq) that
> calculate
> >> the list of differentially expressed genes in 2 conditions (with
> >> replicates) from raw counts. But I do not know what is wrong with the
> >> following simple approach (and whether other people have been using it):
> >>
> >> 1. Get the (estimated) tpm/fpkm for each gene in each sample
> >> 2. Do a t-test for two groups on each gene.
> >> 3. Adjust the p value for multiple tests (p-adj)
> >>
> >>
> >> Thanks,
> >>
> >> Son.
> >>
> >>       [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Dr. Paul Geeleher, PhD
> Section of Hematology-Oncology
> Department of Medicine
> The University of Chicago
> 900 E. 57th St.,
> KCBD, Room 7144
> Chicago, IL 60637
> --
> www.bioinformaticstutorials.com
>

	[[alternative HTML version deleted]]