[BioC] duplicate reads in mRNA-Seq

Jason Lu jasonlu68 at gmail.com
Sun Feb 13 21:34:31 CET 2011


Thanks Simon for the insightful comments.
I think you are right on this. From an empirical comparison I just did
between the RNA-Seq and quantitative-PCR data, the unfiltered one
seems to give better concordance with the PCR data (based on fc).

Thanks again,
Jason


On Sat, Feb 12, 2011 at 12:39 PM, Simon Anders <anders at embl.de> wrote:
> Hi Jason
>
>> It seems that the duplicate reads are very common in mRNA-seq data.
>> Duplicate reads are those being mapped to exact the same chromosome
>> location and on the same strand (maybe from PCR amplification). I
>> would like to know what are the general practice to deal with it? I
>> suspect some of those may contribute to the large overdispersion in
>> the final count data.
>
> I know it is soemtimes recommended to remove them but I'd advise against
> this.
>
> One of the advantages of RNA-Seq over expression microarrays is the large
> gain in dynamic range. On arrays, lowly expressed genes drown in background
> flourescence and highly expressed genes saturate the hybridisation, giving
> you a dynamic range of typically little more 25 dB (i.e., ratios of up to
> at most 1:300).
>
> In RNA-Seq, very weak genes give rise to less than 10 counts while the
> strongest genes may give more well above 100,000 counts, i.e., the usable
> dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000).
>
> Now, imagine you would count several reads mapping to the same position at
> most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000
> counts, even if it were one of those strongly expressed ones with 5-figure
> raw count. Hence, you would dramatically squash your dynamic range and lose
> all hope for linearity (i.e., you cannot expect any more that the count
> rate is at least roughly proportional to the concentration).
>
> Of course, if there are PCR artifacts, they destroy the linearity as well.
> So, if you have an exon, to which only very few reads map except for one
> specific position that shows a pile of hundreds of reads, all with
> precisely the same coordinates, then is reason for concern. I have seen
> such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all
> recently, but I think they were a common concern two years ago. I wonder
> where they went. Did they maybe improve the PCR steps of the library
> preparation protocols?)
>
>  Simon
>
>
>
>



More information about the Bioconductor mailing list