[BioC] duplicate reads in mRNA-Seq

Sat Feb 12 18:39:01 CET 2011

Hi Jason

> It seems that the duplicate reads are very common in mRNA-seq data.
> Duplicate reads are those being mapped to exact the same chromosome
> location and on the same strand (maybe from PCR amplification). I
> would like to know what are the general practice to deal with it? I
> suspect some of those may contribute to the large overdispersion in
> the final count data.

I know it is soemtimes recommended to remove them but I'd advise against
this.

One of the advantages of RNA-Seq over expression microarrays is the large
gain in dynamic range. On arrays, lowly expressed genes drown in background
flourescence and highly expressed genes saturate the hybridisation, giving
you a dynamic range of typically little more 25 dB (i.e., ratios of up to
at most 1:300).

In RNA-Seq, very weak genes give rise to less than 10 counts while the
strongest genes may give more well above 100,000 counts, i.e., the usable
dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000).

Now, imagine you would count several reads mapping to the same position at
most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000
counts, even if it were one of those strongly expressed ones with 5-figure
raw count. Hence, you would dramatically squash your dynamic range and lose
all hope for linearity (i.e., you cannot expect any more that the count
rate is at least roughly proportional to the concentration).

Of course, if there are PCR artifacts, they destroy the linearity as well.
So, if you have an exon, to which only very few reads map except for one
specific position that shows a pile of hundreds of reads, all with
precisely the same coordinates, then is reason for concern. I have seen
such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all
recently, but I think they were a common concern two years ago. I wonder
where they went. Did they maybe improve the PCR steps of the library
preparation protocols?)

  Simon