[BioC] Getting the length of every element from a large CompressedIRangesList is slow

Nicolas Delhomme delhomme at embl.de
Mon Jul 2 19:02:45 CEST 2012


Hej!

I've a rather large CompressedIRangesList

>print(object.size(aln.ranges),unit="Mb")
390.4 Mb

that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47).

Retrieving the element length is slow:

>system.time(sizes <- sapply(aln.ranges,length))

user  system elapsed 
265.777 169.222 443.498

by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load.

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] IRanges_1.15.15    BiocGenerics_0.3.0

loaded via a namespace (and not attached):
[1] stats4_2.15.1

Nico

P.S. If you need, I can send my aln.ranges object off-list.

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany



More information about the Bioconductor mailing list