[Rd] subsetting by name is very slow when subscript contains a lot of "invalid" names

Wed May 8 22:53:57 CEST 2013

Hi,

Note sure why but subsetting by name is *very* slow when the character
vector used as subscript contains a lot of "invalid" names:

   x <- c(A=10L, B=20L, C=30L)
   subscript <- c(LETTERS[1:3], sprintf("ID%05d", 1:150000))

   > system.time(y1 <- x[subscript])
      user  system elapsed
   111.991   0.000 112.230

Since subsetting by name is basically equivalent to

   i <- match(subscript, names(x))
   x[i]

it's quite surprising that the former is more than 10 thousand times
slower than the latter:

   > system.time({i <- match(subscript, names(x)); y2 <- x[i]})
      user  system elapsed
     0.008   0.000   0.007

   > identical(y2, y1)
   [1] TRUE

Thanks,
H.

PS: This issue was already reported here
   https://stat.ethz.ch/pipermail/r-devel/2010-July/057945.html
in 2010, and with a proposed fix by Martin Morgan.

 > sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] GenomicRanges_1.13.8 IRanges_1.19.3       BiocGenerics_0.7.2

loaded via a namespace (and not attached):
[1] stats4_3.0.0 tools_3.0.0

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319