[BioC] edgeR cpm filtering

James W. MacDonald jmacdon at uw.edu
Mon Feb 11 18:52:13 CET 2013


Hi John,

On 2/11/2013 11:54 AM, John [guest] wrote:
> All,
>
> I am a new edgeR user. I have difficulty understanding the meaning of the ‘cpm’ function of edgeR package.  I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don’t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M?

It's reads. You are inputting read counts per gene, so by definition 
edgeR knows nothing about bases. As for why 1M, I don't know for sure, 
but would imagine it is because a 'reasonable' sized RNA-Seq experiment 
is based on somewhere around 10M reads or so.

In other words, say you have a sample with 10M reads, and one gene has 
60 reads that align. You would have 6 cpm, which looks pretty low, and 
it is. However you could do statistics on it based on a discrete 
distribution like the negative binomial.

If you used 10M to normalize, the cp10M would be 0.6, so you would have 
to throw that gene out. If you used cpk, it would be 6000 cpk and it 
would look like you had lots of reads when in fact you only had 60.

So cpm looks like a reasonable compromise to me.

>
> Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example:
>
> Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2
> Gene_X, 150,100, 270,320,0,0
>
> I used:
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered>  5)>  2,]
>
> But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this?

Setting aside the logic of removing these genes (which IMO is missing 
the point of looking for differential expression), note the logic of 
your filter. Let's break it down by section:

rowSums(cpm_filtered > 5)

This gives the count of samples where the cpm value is > 5. In the case 
you mention, this would be four.

rowSums(cpm_filtered > 5) > 2

This results in TRUE if the count of samples for a given gene with a cpm 
value > 5 is greater than two. So you are saying that at least two 
samples have to have a cpm > 5. In the instance you want to filter, the 
count is 4, and 4 > 2, so this passes the filter.

So what you apparently want are rows where the cpm is greater than some 
value in ALL samples, not just two of them, so you would want to change 
the > 2 part to == 6.

Note that this doesn't really make any sense. You are in effect saying 
that you are completely uninterested in any genes that appear not to be 
expressed in one of your samples, but that might be highly expressed in 
one or more of the other samples. That to me is actually really 
interesting, and I am not sure why you would want to filter out any gene 
that fulfills that criterion.

Best,

Jim

>
> Thank you.
> John
>
>
>   -- output of sessionInfo():
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
>
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
>
> locale:
>
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
>
> attached base packages:
>
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
> other attached packages:
>
> [1] edgeR_2.6.0  limma_3.12.0
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list