[BioC] edgeR cpm filtering

Mon Feb 11 23:10:05 CET 2013

Hi John,

Please don't take things off-list. Even if you are not a subscriber (and 
if you are using BioC stuff you should be, and you can always stop 
delivery but remain a subscriber), I believe that replying to an 
existing thread will work.

I don't see any zero counts causing a problem. Using the example for 
cpm() as a starting point, and modifying to have a set with zero counts, 
I get this:

 > y
      [,1] [,2] [,3] [,4]
[1,]    1    2   14   11
[2,]   11   25    1   26
[3,]    1   22    2   19
[4,]    5    6   15    6
[5,]    0    0    1    5
 > d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2)))
 > d <- estimateCommonDisp(d)
 > d <- estimateTagwiseDisp(d)
 > topTags(exactTest(d))
Comparison of groups:  2-1
        logFC   logCPM       PValue          FDR
1  2.9550376 12.76964 6.109348e-05 0.0003054674
5  4.6421574 10.54712 1.283343e-01 0.3208358043
4  0.9149142 12.96222 2.668415e-01 0.4447357815
2 -0.4149407 13.93933 8.539261e-01 0.9783799675
3 -0.1325391 13.42121 9.783800e-01 0.9783799675

So the sample with zero counts (sample 5), is the second row in the 
topTags() output, and it has no problem computing a logFC value.

Best,

Jim

On 2/11/2013 4:30 PM, John Sperry wrote:
> Hi again Jim,
>
> One more thing, in microarray days, people used to add a small value, 
> let say 1 to the 0 values to avoid non-sense fold changes. It's not 
> the case in NGS any more right? so it's not possible to do that in 
> edgeR, right? that's why I was thinking about filtering out with cpm.
>
> Thanks,
> John
>
>
>
> ------------------------------------------------------------------------
> *From:* John Sperry <johnsperry51 at yahoo.com>
> *To:* "jmacdon at uw.edu" <jmacdon at uw.edu>
> *Sent:* Monday, February 11, 2013 1:47 PM
> *Subject:* [BioC] edgeR cpm filtering
>
> Hi Jim,
>
> I'm very new to edgeR and BioC. I couldn't reply to your post in BioC, 
> so here is my post in an email :D
>
> I still cannot see why 1M is chosen, but I appreciate your explanations.
>
> About the cpm filtering, the reason that I chose '> 2' for 3 samples 
> with each having 2 replicates was that I though edgeR must be smart 
> enough to figure out that when I say more than 5 reads per million for 
> more than 2 samples, it means for ALL the replicates of each samples! 
> which apparently is not the case! thanks for pointing that out!
>
> as for the reason for wanting to get rid of the sample 3 with 2 
> replicates that have 0 reads mapped to them, I don't want them, 
> because, they cause the logFC to become huge non-sense numbers! i 
> guess dividing be 0 causes the problem! so I thought for not seeing 
> weird values when the significant genes are selected, it's better to 
> get rid of genes that have 0 reads mapped to any of their groups. Does 
> it make sense?
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,]
>
> Thanks,
> John
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099