[BioC] Why is *ply-ing over a GRangesList much slower than *ply-ing over an IRangesList?

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed Aug 25 16:47:23 CEST 2010


Hi Michael,

On Wed, Aug 25, 2010 at 10:21 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> My guess is that your GRangesList is compressed, whereas the IRangesList is
> uncompressed. Extracting an element from a compressed list will be slower
> due to the compression.

Actually, the IRangesList from the example above is also compressed:

R> is(irl)
[1] "CompressedIRangesList" "IRangesList"           "CompressedList"
[4] "RangesList"            "Sequence"              "Annotated"

So I'm not sure that is what's causing the speed difference, right?

I wrote this portion below before I checked if `irl` was compressed or
not, but I'm curious about it, so I'll keep the question, assuming
that there will be some significant speed difference between iterating
over compressed lists anyway:

My next question was if there was anyway to have an uncompressed
GRangesList, so I went poking around the IRanges/GenomicRanges code.

It seems the answer to that is no, since GRangesList extends/contains
CompressedList ... right?

Would it be (technically) possible to have something like
CompressedGRangesList and a "normal" GRangesList -- analogous to how
we currently have an IRangesList and CompressedIRangesList ... or is
there some other reason that all GRangesList must be CompressedLists?

Thanks,
-steve


>
> Michael
>
> On Tue, Aug 24, 2010 at 7:31 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>>
>> Hi,
>>
>> Looping using any of the *ply (lapply, sapply, seqapply, etc.) seems
>> to be significantly slower when you are iterating over a GRangesList
>> vs. an IRangesList:
>>
>> R> library(GenomicFeatures)
>> R> txdb <- loadFeatures(system.file("extdata",
>> "UCSC_knownGene_sample.sqlite",
>>      package="GenomicFeatures"))
>> R> xcripts <- transcriptsBy(txdb, 'gene')
>> R> system.time(l1 <- sapply(xcripts, length))
>>   user  system elapsed
>>  2.298   0.003   2.302
>>
>> irl <- IRangesList(lapply(xcripts, ranges))
>> system.time(l2 <- sapply(irl, length))
>>   user  system elapsed
>>  0.047   0.001   0.049
>>
>> R> identical(l1, l2)
>> [1] TRUE
>>
>> I was curious if this is known/expected behavior and it's unavoidable, or
>> .. ?
>>
>> Thanks,
>> -steve
>>
>> R> sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-08-21 r52791)
>> Platform: i386-apple-darwin10.4.0/i386 (32-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] org.Hs.eg.db_2.4.1     RSQLite_0.9-2          DBI_0.2-5
>>  AnnotationDbi_1.11.4
>> [5] Biobase_2.9.0          GenomicFeatures_1.1.11 GenomicRanges_1.1.20
>>  IRanges_1.7.21
>>
>> loaded via a namespace (and not attached):
>> [1] BSgenome_1.17.6    Biostrings_2.17.29 RCurl_1.4-3        XML_3.1-1
>>         biomaRt_2.5.1
>> [6] rtracklayer_1.9.7  tools_2.12.0
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list