[BioC] edgeR cpm filtering

Mon Feb 11 17:54:54 CET 2013

All,

I am a new edgeR user. I have difficulty understanding the meaning of the â€˜cpmâ€™ function of edgeR package.  I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I donâ€™t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M?

Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example:

Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2
Gene_X, 150,100, 270,320,0,0

I used:

d_DGEList  <- d_DGEList[rowSums(cpm_filtered > 5) > 2,]

But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this?

Thank you.
John

 -- output of sessionInfo(): 

> sessionInfo()

R version 2.15.0 (2012-03-30)

Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:

[1] edgeR_2.6.0  limma_3.12.0
>

--
Sent via the guest posting facility at bioconductor.org.