[BioC] edgeR and FDR

Basically, if a global FDR is used with discrete data, then one 
should filter low expressing genes pretty stringently.  For example, 
one could compute K (the marginal total for the gene) for which the 
smallest possible p-value is .001 (e.g. use the ordinary Fisher's 
exact test as an approximation) and use only features with K or more 
reads in the study.  This improves power for the (much smaller number 
of) remaining features, but obviously you will then need to sort 
manually through the low expressing genes to determine if you have 
missed something striking (such as all of the K-1 reads are in a 
single sample).


At 10:39 AM 6/26/2010, you wrote:
>Hi Naomi,
>I agree that the discreteness of the counts introduces conservatism, 
>and that there is a power differential between low and high 
>expressed genes. However the expected overall FDR is still 
>controlled at a rate less than or equal to the nominal rate, and 
>that is all we promise.
>To reduce the trend in DE vs expression level, I like to combine FDR 
>with a fold-change cutoff or, perhaps better, use a TREAT like test.
>On Sat, 26 Jun 2010, Naomi Altman wrote:
>>Dear Gordon,
>>Thank you for your very detailed and clear answer to my question 
>>about the dispersion model.
>>Regarding FDR:
>>For discrete-valued test statistics, the distribution of the 
>>p-values under the null hypothesis is a discrete uniform which 
>>depends on the marginal total.  As a result,
>>under the distribution of p-values from the null hypotheses is a 
>>mixture of discrete uniforms, which can be marginally very 
>>non-uniform.  Even after filtering out low expressing genes, it is 
>>common to see a peak of p-values near 1.0 due to this effect.  It 
>>is less evident that there are multiple other peaks, one at each of 
>>the discrete values of the p-value for each marginal total.  The 
>>result of this is that FDR computations are far too conservative 
>>for lowly expressing genes, and far too liberal for highly 
>>expressing genes which basically magnifies the power differential 
>>that already exists due to the relationship between the mean and variance.
>>At 05:01 AM 6/26/2010, Gordon K Smyth wrote:
>>>Dear Zhe,
>>>To get FDR, you must use the topTags() function.  Is your de.com 
>>>object a deDGEList object?  If it is, then
>>>   top <- topTags(de.com, n=Inf)
>>>   write.table(top$table, file="yourfile.txt")
>>>will do what you want.  (I can't tell you what level of FDR to use 
>>>as your cutoff though, that's up to you.)
>>>Naomi, I don't know of any problem with FDR from edgeR.  It should 
>>>work just fine.
>>>Best wishes
>>>------------ original message ---------------
>>>[BioC] edgeR question
>>>Naomi Altman naomi at stat.psu.edu
>>>Fri Jun 25 22:43:51 CEST 2010
>>>Hi Zhe,
>>>1. First normalize and then do the DE
>>>analysis.  (I found this confusing in the vignette, too.)
>>>2. I do not suggest using FDR at this time.  The
>>>standard FDR computations need to be adjusted for
>>>count data.  I do not think this has been worked out yet.
>>>At 12:21 PM 6/25/2010,  wrote:
>>>>I am learning edgeR and would like to use it
>>>>dealing with my Tag-seq and RNA-seq data. I have several questions:
>>>>1. Does the DE analysis using common
>>>>dispersion or moderated tagwise dispersions use
>>>>the TMM method for normalization?  I am not
>>>>sure the relationship between Setion 6
>>>>(Normalization) and the following sections in
>>>>the user manual. I suppose I should normalize
>>>>the data first, and then perform DE analysis.
>>>>2. Do you suggest to use P-value < 0.01? What
>>>>about FDR < 0.05? After saving de.tagwise (>
>>>>write.table(de.com[[1]], file =
>>>>"/Users/Zhe/edgeR/page7", sep = "\t")), I found
>>>>there is not a column of the FDR. How to
>>>>calculate the FDR for each gene and save it in the output file.
>>>>Thanks a lot.
>>>>Best wishes,
