[BioC] Getting the length of every element from a large CompressedIRangesList is slow

Hervé Pagès hpages at fhcrc.org
Tue Jul 10 21:08:52 CEST 2012


Nico,

On 07/03/2012 12:36 AM, Nicolas Delhomme wrote:
> That's great! Thanks Hervé.
>
> I remember seeing that in a thread in the mailing list, but couldn't recall it. And I couldn't find it in the documentation. Could it made more obvious by being added to the IRangesList Rd page, as part of the "see also" section, as well as in the IRangesList-utils Rd page? That would be great too :-)

Good point. The elementLengths() generic is documented in the man
page for List because, like [[, elementType(), lapply(), endoapply(),
etc... it's a basic functionality of any List object, i.e. of any
object that belongs to a concrete subclass of List. Note that there
are more than 90 List subclasses defined in the IRanges package.
Each subclass of course inherits all the methods defined for all
the parent classes and defines its own specific generic/methods.

IRangesList derives from List via RangesList:

   List <-- RangesList <-- IRangesList

What was missing was a "see also" section in the man page for the
RangesList class that points to the man page for List. I just added
it in IRanges 1.15.19. Hopefully that will make it easier for the
user to discover elementLengths() as well as any of the other basic
List functionalities.

Cheers,
H.

>
> Cheers,
>
> Nico
>
>
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On Jul 2, 2012, at 8:25 PM, Hervé Pagès wrote:
>
>> Hi Nico,
>>
>> Even faster:
>>
>>   > system.time(sizes <- elementLengths(exbytx))
>>      user  system elapsed
>>     0.000   0.000   0.001
>>
>> Note that you can use elementLengths on any list-like object
>> ("list-like" = list or List class or subclass):
>>
>>   > x <- rep(list(a=1:4, b=letters), 500000)
>>   > length(x)
>>   [1] 1000000
>>   > system.time(x_eltlens <- sapply(x, length))
>>      user  system elapsed
>>     3.132   0.008   3.142
>>   > system.time(x_eltlens2 <- elementLengths(x))
>>      user  system elapsed
>>     0.024   0.000   0.023
>>   > identical(x_eltlens, x_eltlens2)
>>   [1] TRUE
>>
>> HTH,
>>
>> H.
>>
>> On 07/02/2012 10:18 AM, Nicolas Delhomme wrote:
>>> Hi,
>>>
>>> Just to extend on my previous message:
>>>
>>> Doing this instead is fast:
>>>
>>>> system.time(sizes <- sapply(width(aln.ranges),length))
>>>
>>>    user  system elapsed
>>>    1.109   0.144   1.254
>>>
>>> Cheers,
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Genome Biology Computational Support
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>> On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote:
>>>
>>>> Hej!
>>>>
>>>> I've a rather large CompressedIRangesList
>>>>
>>>>> print(object.size(aln.ranges),unit="Mb")
>>>> 390.4 Mb
>>>>
>>>> that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47).
>>>>
>>>> Retrieving the element length is slow:
>>>>
>>>>> system.time(sizes <- sapply(aln.ranges,length))
>>>>
>>>> user  system elapsed
>>>> 265.777 169.222 443.498
>>>>
>>>> by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load.
>>>>
>>>>> sessionInfo()
>>>> R version 2.15.1 (2012-06-22)
>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>
>>>> locale:
>>>> [1] C/UTF-8/C/C/C/C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] IRanges_1.15.15    BiocGenerics_0.3.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] stats4_2.15.1
>>>>
>>>> Nico
>>>>
>>>> P.S. If you need, I can send my aln.ranges object off-list.
>>>>
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>>
>>>> Genome Biology Computational Support
>>>>
>>>> European Molecular Biology Laboratory
>>>>
>>>> Tel: +49 6221 387 8310
>>>> Email: nicolas.delhomme at embl.de
>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>> 69102 Heidelberg, Germany
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list