[Rd] Very slow subsetting by name

Martin Morgan mtmorgan at fhcrc.org
Thu Jul 15 17:38:53 CEST 2010


On 07/15/2010 01:12 AM, Hervé Pagès wrote:
> Hi,
> 
> I'm subsetting a named vector using character indices.
> My vector of indices (or keys) is 10x longer than the vector
> I'm subsetting. All my keys are distinct and only 10% of them
> are valid (i.e. match a name of the vector being subsetted).
> It is surprisingly slow:
> 
> x1 <- 1:1000
> names(x1) <- paste("a", x1, sep="")
> keys <- sample(c(names(x1), paste("b", 1:9000, sep="")))
>> system.time(y1 <- x1[keys])
>    user  system elapsed
>   0.410   0.000   0.416
> 
> x2 <- 1:2000
> names(x2) <- paste("a", x2, sep="")
> keys <- sample(c(names(x2), paste("b", 1:18000, sep="")))
>> system.time(y2 <- x2[keys])
>    user  system elapsed
>   1.730   0.000   1.736

For what its worth, I think this comes about in the loop starting at
subscript.c:538, which seems to be there to allow [<-,*,character to
extend a vector with new named elements

> x=c(a=1)
> x["b"] = 2
> x
a b
1 2

It seems to be irrelevant (?) for sub-setting per se (though by analogy
one might expect x["c"] to return a length-1 vector NA with name "c",
whereas it returns a vector with names NA).

Seems like the O(n^2) loop through NonNullStringMatch could be replaced
by look-ups into a hash, or an additional argument could be propagated
to stringSubscript to exit early when names aren't required. Or the call
to makeSubscript at subset.c:164 could instead be made to matchE in
unique.c.

Martin

> 
> x3 <- 1:4000
> names(x3) <- paste("a", x3, sep="")
> keys <- sample(c(names(x3), paste("b", 1:36000, sep="")))
>> system.time(y3 <- x3[keys])
>    user  system elapsed
>   8.900   0.010   9.227
> 
> x4 <- 1:8000
> names(x4) <- paste("a", x4, sep="")
> keys <- sample(c(names(x4), paste("b", 1:72000, sep="")))
>> system.time(y4 <- x4[keys])
>    user  system elapsed
> 130.390   0.000 132.316
> 
> And it's apparently worse than quadratic in time!
> 
> I'm wondering why this subsetting by name is so slow since it
> seems it could be implemented with x4[match(keys, names(x4))],
> which is very fast: only 0.012s!
> 
> This is with R-2.11.0 and R-2.12.0.
> 
> Thanks,
> H.
> 
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the R-devel mailing list