[BioC] RNA-seq differentially expressed gene finding methods
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Sep 7 03:38:58 CEST 2014
For previous discussion on this list see
https://stat.ethz.ch/pipermail/bioconductor/2013-May/052802.html
This and the voom paper discuss what one needs to do to make t-tests work
well in the RNA-seq context.
Gordon
On Sun, 7 Sep 2014, Gordon K Smyth wrote:
> Dear Son,
>
> The problem has little to do with normality or group size and more to do with
> the fact that fpkm values can have very different variances depending on the
> size of the original count. The creates a problem for the t-test which
> assumes equal variances.
>
> See the voom paper for discussion of this:
>
> http://genomebiology.com/2014/15/2/R29
>
> Best wishes
> Gordon
>
>> Date: Fri, 5 Sep 2014 10:31:25 -0700
>> From: Son Pham <spham at salk.edu>
>> To: Paul Geeleher <paulgeeleher at gmail.com>
>> Cc: Bioconductor mailing list <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] RNA-seq differentially expressed gene finding
>> methods
>>
>> Thank you Richard, Devon and Paul for very insight answers.
>> I completely agree that the approach I raised above is inappropriate when
>> the group size is small (3, 4...).
>> But when the group size is large enough ( > 20 or 30), the sampling
>> distribution of the mean will be (closed to) normally distributed, and that
>> is why I believe that the t-test is ok.
>>
>>
>> -Son.
>>
>>
>>
>>
>> On Fri, Sep 5, 2014 at 10:05 AM, Paul Geeleher <paulgeeleher at gmail.com>
>> wrote:
>>
>>> Hi Son,
>>>
>>> My understanding is that the approach you describe could be considered
>>> valid for large enough numbers of samples, however, RNA-seq
>>> experiments will typically have smaller numbers (<30) samples per
>>> condition, meaning that a t-test is not valid (because RNA-seq data
>>> isn't normally distributed). However, while I don't think that a
>>> t-test is "invalid" given enough samples, its very difficult to
>>> justify using such a method when much better powered methods have been
>>> invented specifically for this type of data.
>>>
>>> Paul
>>>
>>> On Fri, Sep 5, 2014 at 11:52 AM, Richard Friedman
>>> <friedman at c2b2.columbia.edu> wrote:
>>>> Dear Son,
>>>>
>>>> The t-test assumes a normal distribution,
>>>> which is appropriate for continous variables. RNAseq
>>>> data deals with counts (discrete entities). A negative binomial
>>> distribution
>>>> (EdgeR, Deseq) or a mean dependent variance (VOOM)
>>>> is much more approriate. Also the 3 methods mentioned
>>>> above estimate variablity better with information from all genes
>>>> using empirical Bayesian methods, than does the one-gene
>>>> at-a-time frequentist t-test.
>>>>
>>>> Best wishes,
>>>> Rich
>>>> Richard A. Friedman, PhD
>>>> Associate Research Scientist,
>>>> Biomedical Informatics Shared Resource
>>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>>> Lecturer,
>>>> Department of Biomedical Informatics (DBMI)
>>>> Educational Coordinator,
>>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)/
>>>> Columbia Department of Systems Biology
>>>> Room 824
>>>> Irving Cancer Research Center
>>>> Columbia University
>>>> 1130 St. Nicholas Ave
>>>> New York, NY 10032
>>>> (212)851-4765 (voice)
>>>> friedman at c2b2.columbia.edu
>>>> http://friedman.c2b2.columbia.edu/
>>>>
>>>> "There is nothing in my Contemporary Jewish Literature course that is
>>>> either contemporary, Jewish, or literature".
>>>>
>>>> -Rose Friedman, age 17
>>>>
>>>>
>>>> On Sep 5, 2014, at 12:44 PM, Son Pham wrote:
>>>>
>>>>> Dear all,
>>>>> I know that we have quite very good packages (edgeR, deseq) that
>>> calculate
>>>>> the list of differentially expressed genes in 2 conditions (with
>>>>> replicates) from raw counts. But I do not know what is wrong with the
>>>>> following simple approach (and whether other people have been using it):
>>>>>
>>>>> 1. Get the (estimated) tpm/fpkm for each gene in each sample
>>>>> 2. Do a t-test for two groups on each gene.
>>>>> 3. Adjust the p value for multiple tests (p-adj)
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Son.
>>>>>
>>>
>>>
>>> --
>>> Dr. Paul Geeleher, PhD
>>> Section of Hematology-Oncology
>>> Department of Medicine
>>> The University of Chicago
>>> 900 E. 57th St.,
>>> KCBD, Room 7144
>>> Chicago, IL 60637
>>> --
>>> www.bioinformaticstutorials.com
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list