[R] Fast multiple match function

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Wed Apr 8 00:41:11 CEST 2015


You might find the data.table package helpful. It uses an index sorted with a radix sort and minimizes moving the data around in memory.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania <kshav.91 at gmail.com> wrote:
>Hi all,
>
>Thanks for the responses.
>Herve's example is a good small size example of what I wanted.
>
>> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2)
>> someCoolFunc(-2, y)
>[1] 3 9
>> someCoolFunc(15, y)
>[1] 4 5 8
>
>The requirement is that I want someCoolFunc() to run in O(number of
>matches) time, instead of O(size of y).
>This is because y is big. And I don't know all the queries I want to
>do up-front. And the results of some queries might change the queries
>I want to do in the future.
>
>@David: I hope the above description is more clear.
>@Enrico, Herve: I want both the functionality provided by one function.
>- On repeated calls, fmatch() does give O(1) performance, but it does
>not give all matches.
>- findMatches() gives all matches, but I need to know the entire
>vector x beforehand. I don't have that luxury.
>
>
>I do have something that works now, using split and fmatch (package
>fastmatch). So just posting that in case anyone in the future has the
>same problem.
>> y.unique <- unique(y)
>>
>> # create a map from the unique elements of y to the locations of all
>occurrences of the element
>> y.map <- split(1:length(y), match(y, y.unique))
>>
>> # write a wrapper function that does a look-up on the unique list.
>and then returns all matches using the map.
>> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }
>
>
>
>On Tue, 7 Apr 2015 at 13:21 Hervé Pagès <hpages at fredhutch.org> wrote:
>>
>> Hi Keshav,
>>
>> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
>think
>> does what you want:
>>
>>    library(IRanges)
>>    y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
>>    x <- c(unique(y), 999L)
>>    hits <- findMatches(x, y)
>>
>> Then:
>>
>>    > hits
>>    Hits object with 9 hits and 0 metadata columns:
>>          queryHits subjectHits
>>          <integer>   <integer>
>>      [1]         1           1
>>      [2]         2           2
>>      [3]         3           3
>>      [4]         3           9
>>      [5]         4           4
>>      [6]         4           5
>>      [7]         4           8
>>      [8]         5           6
>>      [9]         6           7
>>      -------
>>      queryLength: 7
>>      subjectLength: 9
>>
>> The Hits object can be turned into a list with:
>>
>>    > as.list(hits)
>>    [[1]]
>>    [1] 1
>>
>>    [[2]]
>>    [1] 2
>>
>>    [[3]]
>>    [1] 3 9
>>
>>    [[4]]
>>    [1] 4 5 8
>>
>>    [[5]]
>>    [1] 6
>>
>>    [[6]]
>>    [1] 7
>>
>>    [[7]]
>>    integer(0)
>>
>> H.
>>
>>  > sessionInfo()
>> R version 3.2.0 beta (2015-04-05 r68151)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.2 LTS
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats4    stats     graphics  grDevices utils    
>datasets
>> [8] methods   base
>>
>> other attached packages:
>> [1] IRanges_2.1.43       S4Vectors_0.5.22     BiocGenerics_0.13.11
>>
>> loaded via a namespace (and not attached):
>> [1] tools_3.2.0
>>
>> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
>> > Hi,
>> >
>> > I know that one can find all occurrences of x in a vector v by
>doing
>> >> which(x == v).
>> >
>> > However, if I need to do this again and again, where v is remaining
>the
>> > same, then this is quite inefficient. In my particular case, I need
>to do
>> > this millions of times, and length(v) = 100 million.
>> >
>> > Does anyone have suggestion on how to go about it?
>> > I know of a package called fmatch that does the above for the match
>> > function. But they don't handle multiple matches.
>> >
>> > Thanks
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list