[BioC] RNA-seq differentially expressed gene finding methods
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Sep 7 03:20:08 CEST 2014
Dear Son,
The problem has little to do with normality or group size and more to do
with the fact that fpkm values can have very different variances depending
on the size of the original count. The creates a problem for the t-test
which assumes equal variances.
See the voom paper for discussion of this:
http://genomebiology.com/2014/15/2/R29
Best wishes
Gordon
> Date: Fri, 5 Sep 2014 10:31:25 -0700
> From: Son Pham <spham at salk.edu>
> To: Paul Geeleher <paulgeeleher at gmail.com>
> Cc: Bioconductor mailing list <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] RNA-seq differentially expressed gene finding
> methods
>
> Thank you Richard, Devon and Paul for very insight answers.
> I completely agree that the approach I raised above is inappropriate when
> the group size is small (3, 4...).
> But when the group size is large enough ( > 20 or 30), the sampling
> distribution of the mean will be (closed to) normally distributed, and that
> is why I believe that the t-test is ok.
>
>
> -Son.
>
>
>
>
> On Fri, Sep 5, 2014 at 10:05 AM, Paul Geeleher <paulgeeleher at gmail.com>
> wrote:
>
>> Hi Son,
>>
>> My understanding is that the approach you describe could be considered
>> valid for large enough numbers of samples, however, RNA-seq
>> experiments will typically have smaller numbers (<30) samples per
>> condition, meaning that a t-test is not valid (because RNA-seq data
>> isn't normally distributed). However, while I don't think that a
>> t-test is "invalid" given enough samples, its very difficult to
>> justify using such a method when much better powered methods have been
>> invented specifically for this type of data.
>>
>> Paul
>>
>> On Fri, Sep 5, 2014 at 11:52 AM, Richard Friedman
>> <friedman at c2b2.columbia.edu> wrote:
>>> Dear Son,
>>>
>>> The t-test assumes a normal distribution,
>>> which is appropriate for continous variables. RNAseq
>>> data deals with counts (discrete entities). A negative binomial
>> distribution
>>> (EdgeR, Deseq) or a mean dependent variance (VOOM)
>>> is much more approriate. Also the 3 methods mentioned
>>> above estimate variablity better with information from all genes
>>> using empirical Bayesian methods, than does the one-gene
>>> at-a-time frequentist t-test.
>>>
>>> Best wishes,
>>> Rich
>>> Richard A. Friedman, PhD
>>> Associate Research Scientist,
>>> Biomedical Informatics Shared Resource
>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>> Lecturer,
>>> Department of Biomedical Informatics (DBMI)
>>> Educational Coordinator,
>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)/
>>> Columbia Department of Systems Biology
>>> Room 824
>>> Irving Cancer Research Center
>>> Columbia University
>>> 1130 St. Nicholas Ave
>>> New York, NY 10032
>>> (212)851-4765 (voice)
>>> friedman at c2b2.columbia.edu
>>> http://friedman.c2b2.columbia.edu/
>>>
>>> "There is nothing in my Contemporary Jewish Literature course that is
>>> either contemporary, Jewish, or literature".
>>>
>>> -Rose Friedman, age 17
>>>
>>>
>>> On Sep 5, 2014, at 12:44 PM, Son Pham wrote:
>>>
>>>> Dear all,
>>>> I know that we have quite very good packages (edgeR, deseq) that
>> calculate
>>>> the list of differentially expressed genes in 2 conditions (with
>>>> replicates) from raw counts. But I do not know what is wrong with the
>>>> following simple approach (and whether other people have been using it):
>>>>
>>>> 1. Get the (estimated) tpm/fpkm for each gene in each sample
>>>> 2. Do a t-test for two groups on each gene.
>>>> 3. Adjust the p value for multiple tests (p-adj)
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Son.
>>>>
>>
>>
>> --
>> Dr. Paul Geeleher, PhD
>> Section of Hematology-Oncology
>> Department of Medicine
>> The University of Chicago
>> 900 E. 57th St.,
>> KCBD, Room 7144
>> Chicago, IL 60637
>> --
>> www.bioinformaticstutorials.com
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list