[BioC] RNA-seq differentially expressed gene finding methods

Gordon K Smyth smyth at wehi.EDU.AU
Sun Sep 7 03:38:58 CEST 2014


For previous discussion on this list see

  https://stat.ethz.ch/pipermail/bioconductor/2013-May/052802.html

This and the voom paper discuss what one needs to do to make t-tests work 
well in the RNA-seq context.

Gordon


On Sun, 7 Sep 2014, Gordon K Smyth wrote:

> Dear Son,
>
> The problem has little to do with normality or group size and more to do with 
> the fact that fpkm values can have very different variances depending on the 
> size of the original count.  The creates a problem for the t-test which 
> assumes equal variances.
>
> See the voom paper for discussion of this:
>
> http://genomebiology.com/2014/15/2/R29
>
> Best wishes
> Gordon
>
>> Date: Fri, 5 Sep 2014 10:31:25 -0700
>> From: Son Pham <spham at salk.edu>
>> To: Paul Geeleher <paulgeeleher at gmail.com>
>> Cc: Bioconductor mailing list <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] RNA-seq differentially expressed gene finding
>> 	methods
>> 
>> Thank you Richard,  Devon and Paul for very insight answers.
>> I completely agree that the approach I raised above is inappropriate when
>> the group size is small (3, 4...).
>> But when the group size is large enough ( > 20 or 30), the sampling
>> distribution of the mean will be (closed to) normally distributed, and that
>> is why I believe that the t-test is ok.
>> 
>> 
>> -Son.
>> 
>> 
>> 
>> 
>> On Fri, Sep 5, 2014 at 10:05 AM, Paul Geeleher <paulgeeleher at gmail.com>
>> wrote:
>> 
>>> Hi Son,
>>> 
>>> My understanding is that the approach you describe could be considered
>>> valid for large enough numbers of samples, however, RNA-seq
>>> experiments will typically have smaller numbers (<30) samples per
>>> condition, meaning that a t-test is not valid (because RNA-seq data
>>> isn't normally distributed). However, while I don't think that a
>>> t-test is "invalid" given enough samples, its very difficult to
>>> justify using such a method when much better powered methods have been
>>> invented specifically for this type of data.
>>> 
>>> Paul
>>> 
>>> On Fri, Sep 5, 2014 at 11:52 AM, Richard Friedman
>>> <friedman at c2b2.columbia.edu> wrote:
>>>> Dear Son,
>>>>
>>>>         The t-test assumes a normal distribution,
>>>> which is appropriate for continous variables. RNAseq
>>>> data deals with counts (discrete entities). A negative binomial
>>> distribution
>>>> (EdgeR, Deseq) or a mean dependent variance (VOOM)
>>>> is much more approriate. Also the 3 methods mentioned
>>>> above estimate variablity better with information from all genes
>>>> using empirical Bayesian methods, than does the one-gene
>>>> at-a-time frequentist t-test.
>>>> 
>>>> Best wishes,
>>>> Rich
>>>> Richard A. Friedman, PhD
>>>> Associate Research Scientist,
>>>> Biomedical Informatics Shared Resource
>>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>>> Lecturer,
>>>> Department of Biomedical Informatics (DBMI)
>>>> Educational Coordinator,
>>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)/
>>>> Columbia Department of Systems Biology
>>>> Room 824
>>>> Irving Cancer Research Center
>>>> Columbia University
>>>> 1130 St. Nicholas Ave
>>>> New York, NY 10032
>>>> (212)851-4765 (voice)
>>>> friedman at c2b2.columbia.edu
>>>> http://friedman.c2b2.columbia.edu/
>>>> 
>>>> "There is nothing in my Contemporary Jewish Literature course that is
>>>> either contemporary, Jewish, or literature".
>>>> 
>>>> -Rose Friedman, age 17
>>>> 
>>>> 
>>>> On Sep 5, 2014, at 12:44 PM, Son Pham wrote:
>>>> 
>>>>> Dear all,
>>>>> I know that we have quite very good packages (edgeR, deseq) that
>>> calculate
>>>>> the list of differentially expressed genes in 2 conditions (with
>>>>> replicates) from raw counts. But I do not know what is wrong with the
>>>>> following simple approach (and whether other people have been using it):
>>>>> 
>>>>> 1. Get the (estimated) tpm/fpkm for each gene in each sample
>>>>> 2. Do a t-test for two groups on each gene.
>>>>> 3. Adjust the p value for multiple tests (p-adj)
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Son.
>>>>> 
>>> 
>>> 
>>> --
>>> Dr. Paul Geeleher, PhD
>>> Section of Hematology-Oncology
>>> Department of Medicine
>>> The University of Chicago
>>> 900 E. 57th St.,
>>> KCBD, Room 7144
>>> Chicago, IL 60637
>>> --
>>> www.bioinformaticstutorials.com
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list