[BioC] running time: countOverlaps & summarizedOverlaps vs. HTSeq

Wei Shi shi at wehi.EDU.AU
Thu Mar 29 00:07:23 CEST 2012


Hi Milica,

You may try the featureCounts function in Rsubread package, which only uses ~2 minutes to get read counts for mouse genes with a 10 million read dataset. This function calls a C function to do the read counting and the entire operation is carried out in C space rather than in R space. This function is not only fast, but also has a small memory footprint (it does not need to create large R objects at all).

Cheers,
Wei

On Mar 28, 2012, at 10:04 PM, Milica Krunic wrote:

> Hello!
> 
> 
> 
> I am working with cat RNA Seq data and after mapping I wanted to get the
> count tables. So, I tried to do it using countOverlaps and
> summarizedOverlaps in R and HTSeq in python. I've noticed that using R, per
> one sorted .bam file (~20*10^6 reads), no matter which previously mentioned
> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R
> methods I used GRangesList object downloaded directly in R from Biomart. In
> HTSeq I used GTF file provided on Ensembl homepage. Average  cat gene width
> is about 44000 in GRangesList.
> Does anyone know why getting count tables in R takes so long compared to
> HTSeq?
> 
> 
> Thank you!
> 
> Best,
> Milica
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list