[BioC] Getting the length of every element from a large CompressedIRangesList is slow

Hervé Pagès hpages at fhcrc.org
Mon Jul 2 20:25:35 CEST 2012


Hi Nico,

Even faster:

   > system.time(sizes <- elementLengths(exbytx))
      user  system elapsed
     0.000   0.000   0.001

Note that you can use elementLengths on any list-like object
("list-like" = list or List class or subclass):

   > x <- rep(list(a=1:4, b=letters), 500000)
   > length(x)
   [1] 1000000
   > system.time(x_eltlens <- sapply(x, length))
      user  system elapsed
     3.132   0.008   3.142
   > system.time(x_eltlens2 <- elementLengths(x))
      user  system elapsed
     0.024   0.000   0.023
   > identical(x_eltlens, x_eltlens2)
   [1] TRUE

HTH,

H.

On 07/02/2012 10:18 AM, Nicolas Delhomme wrote:
> Hi,
>
> Just to extend on my previous message:
>
> Doing this instead is fast:
>
>> system.time(sizes <- sapply(width(aln.ranges),length))
>
>    user  system elapsed
>    1.109   0.144   1.254
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote:
>
>> Hej!
>>
>> I've a rather large CompressedIRangesList
>>
>>> print(object.size(aln.ranges),unit="Mb")
>> 390.4 Mb
>>
>> that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47).
>>
>> Retrieving the element length is slow:
>>
>>> system.time(sizes <- sapply(aln.ranges,length))
>>
>> user  system elapsed
>> 265.777 169.222 443.498
>>
>> by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load.
>>
>>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] C/UTF-8/C/C/C/C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] IRanges_1.15.15    BiocGenerics_0.3.0
>>
>> loaded via a namespace (and not attached):
>> [1] stats4_2.15.1
>>
>> Nico
>>
>> P.S. If you need, I can send my aln.ranges object off-list.
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list