[BioC] rtracklayer: import.gff seems to be very slow

Michael Dondrup Michael.Dondrup at uni.no
Wed Oct 20 16:04:48 CEST 2010


just installes R 2.12.0 biocondutor 2.7 rtracklayer 1.10 and I can confirm that there is a major improvement 
in the speed of import.gff. 

Thanks a lot for this fix.

On Oct 16, 2010, at 6:39 AM, Michael Lawrence wrote:

> Wow thanks for a serious testing file. There were some bugs and somewhat interesting performance issues. 
> For example, I've discovered that gregexpr with fixed=TRUE is quadratic time with respect to string length (gets real bad up in the millions). Haven't been able to figure out why. This makes fixed=FALSE much quicker. Counterintuitive. substring() is also surprisingly slow.
> Anyway, try the latest SVN.  Or version 1.9.12.
> Still much slower than read.delim. It's the attributes in the last column (being translated to columns in R) that are so costly, and that one has them in significant quantity. I guess I could give an option to disable that parsing (or in general select the desired columns, as suggested previously), but it should be much quicker for you now.
> Thanks again,
> Michael
> On Fri, Oct 15, 2010 at 2:40 AM, Michael Dondrup <Michael.Dondrup at uni.no> wrote:
> Hi,
> I am trying to read in a genome annotation from a GFF3 file from NCBI [1]
> The file is about 7.5 MB and has ~17000 non-comment lines. While I can read the file
> with read.delim in less than a second, trying
> bsub = import.gff("~/Downloads/bsubtilis.gff")
> is very slow. I would rather like to use a standardized function form the package
> that understands various formats, but currently I cannot use it for whole genome
> annotation. Could this be improved, or is the fie format incorrect?
> Best
> Michael
> [1]: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL009126.gff
> > sessionInfo()R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] rtracklayer_1.8.1 RCurl_1.4-2       bitops_1.0-4.1
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0       Biostrings_2.16.0   BSgenome_1.16.1
> [4] GenomicRanges_1.0.9 IRanges_1.6.6       XML_3.1-0
> >
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Michael Dondrup
Post-doctoral researcher
Thormøhlensgate 55, N-5008 Bergen, Norway
Phone: +47 55584157 Fax: +47 55584354
Please note my new phone number

More information about the Bioconductor mailing list