[BioC] [Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?

Simon Anders anders at embl.de
Tue May 21 21:01:39 CEST 2013


Hi Thomas

On 21/05/13 17:49, Thomas Girke wrote:
> Because of these complications, I am sometimes wondering whether one
> couldn't use for many RNA-Seq use cases coverage values (e.g. mean
> coverage) as raw expression measure instead of read counts. Has anyone
> systematically tested whether this would be a suitable approach for the
> downstream DEG analysis? Right now everyone believes RNA-Seq analysis
> requires read counting, but honestly I don't see why that is. Perhaps
> the benefits of this are so minor that it is not worth dealing with a
> different type of expression measure.

The "complications" I had in mind apply to mean coverage as well as to 
reads.

Actually, at least in my personal opinion, it is quite clear which rules 
one should stick to when obtaining the count table, and therefore, 
obtaining a correct count table is not at all complicated if one thinks 
carefully enough about it.

The main issue is this: Imagine, a number of reads can be mapped to two 
distinct genes, either because the genes overlap, or, more commonly, 
because the genes share repetitive sequence. If you count your reads for 
both genes then both genes will appear differentially expressed even if 
only one gene actually is. Hence, one must discard reads that  cannot be 
unambiguously assigned to one gene. Of course, genes which lose reads in 
this manner will have to low expression estimates, i.e., you incur a 
negative bias. However, this bias cancels out when comparing the same 
gene across samples, i.e., it is not a reason for concern in 
differential expression testing.

Consequently, a method which aims at getting _unbiased_ point estimates 
of expression strength is typically unsuitable as input for strength for 
differential expression testing. The point that counting for expression 
estimation and for DE testing are different tasks is subtle and often 
overlooked. Mistakes arising from this can cause strange effects.

A particularly common mistake is to map the reads to transcripts, not 
genes, or to count overlaps not with annotated genes but with annotated 
transcripts. Of course, most reads will map to several transcripts 
because most transcript isoforms of a given gene overlap heavily. If one 
counts reads mapping to multiple transcripts for each of these, one gets 
severe artifacts in downstream analysis, and if one discards multiply 
mapping reads one loses most of the genes.

I do not know what the original poster meant when she said that she 
counted for UCSC transcripts (rather than genes) but if she took care to 
avoid the common pitfalls just described, she must have employed a quite 
sophisticated construction. This is why my first question was how she 
obtained the counts.

 > Right now everyone believes RNA-Seq analysis
 > requires read counting, but honestly I don't see why that is.

Just to clarify this: I don't think that this is universally believed. 
It is only that a method is typically designed for a specific type of 
data, and several commonly used Bioconductor packages for RNA-Seq 
analysis, namely edgeR, DESeq, BaySeq, DSS, expect count data, because 
the specific statistical methods used in these packages are build on the 
assumption that they get read counts. Obviously, they will not give 
correct results if they are given any other kind of data. This is stated 
quite clearly in the vignettes, but nevertheless people keep asking 
whether they can supply arbitrary other kind of data such as rounded 
coverage values. And this insistence in using methods in a manner 
clearly violating instructions is, quite frankly, a bit frustrating. So 
when you observed that people like me seem to be quite insistent on the 
importance of using counts this may have been with respect to these very 
common question, which are of course specific to certain methods.

   Simon



More information about the Bioconductor mailing list