[BioC] GRanges performance issue

Hervé Pagès hpages at fhcrc.org
Fri Jul 8 10:29:24 CEST 2011


Hi Arne,

On 11-07-07 08:45 AM, Mueller, Arne wrote:
> Hello,
>
> I realized there's a massive performance difference to subset Granges objects by name compared to the Granges subset method.
>
> Example:
>
>> length(mm9.tiled)
> [1] 5309835
>> n = names(mm9.tiled)
>> rn = sample(n, 1000)
>> system.time(tmp<- subset(mm9.tiled, names(mm9.tiled) %in% rn))
>     user  system elapsed
>    1.610   0.131   1.741
>> system.time(tmp<- mm9.tiled[rn])
>     user  system elapsed
>   72.793   0.167  72.976

Note that subsetting with

   mm9.tiled[rn]    # A

is not the same as subsetting with

   mm9.tiled[names(mm9.tiled) %in% rn]    # B

because the latter does not reorder the elements.

An equivalent to A would rather be

   mm9.tiled[match(rn, names(mm9.tiled)]    # C

and yes, C is also much faster than A (50x faster on my machine
for a GRanges with 1 million elts). I agree that this can hardly
be justified: I don't see any reason why A couldn't be made as fast
as C (or almost). I believe the culprit is the call to
IRanges:::.bracket.Index() in the "[" method for "GRanges"
objects. I'll try to come up with a fix.

Thanks for reporting this.
H.


>>
>> sessionInfo()
> R version 2.14.0 Under development (unstable) (2011-06-01 r56028)
> Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
>
> other attached packages:
> [1] GenomicRanges_1.5.12 IRanges_1.11.10
>
> loaded via a namespace (and not attached):
> [1] tools_2.14.0
>
>
> Is this a known (wanted?) behavior?
>
>     Regards,
>
>     Arne
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list