[R] Intersecting two matrices

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Wed Jul 31 09:57:51 CEST 2013


I would appreciate it if you would follow the Posting Guide and give a reproducible example and post all messages using plain text.

Try

m1 <- matrix(sample(0:999,2*1057837,TRUE),ncol=2)
m2 <- matrix(sample(0:999,2*951980,TRUE),ncol=2)
df1 <- as.data.frame(m1)
df2 <- as.data.frame(m2)
library(sqldf)
system.time(df3 <- sqldf("SELECT DISTINCT df1.V1, df1.V2 FROM df1 INNER JOIN df2 ON df1.V1=df2.V1 AND df1.V2=df2.V2") )

The speed seems heavily dependent on how many rows are duplicated within the input data frames... so if the range of values is small then the query runs slower. Note also that moving the data from R to the database and back takes time... you may be able to import the data directly from your source data to the database and save some time. Read ?sqldf and ?read.csv.sql examples for more info.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

c char <charlie.hsia.us at gmail.com> wrote:
>I am not familiar with R's sort and sql libs. appreciate if you can
>post a
>code snippet when you got time. Thanks a lot!
>
>
>On Tue, Jul 30, 2013 at 10:36 AM, Jeff Newmiller
><jdnewmil at dcn.davis.ca.us>wrote:
>
>> In that case, you should be looking at a relational inner join,
>perhaps
>> with SQLite (see package sqldf).
>>
>---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                       Live:   OO#.. Dead: OO#.. 
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#. 
>rocks...1k
>>
>---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>>
>> c char <charlie.hsia.us at gmail.com> wrote:
>> >Thanks a lot.
>> >Still looking for some super fast and memory efficient solution, as
>the
>> >matrix I have in real world has billions of rows.
>> >
>> >
>> >On Mon, Jul 29, 2013 at 6:24 PM, William Dunlap <wdunlap at tibco.com>
>> >wrote:
>> >
>> >> I haven't looked at the size-time relationship, but im2 (below) is
>> >faster
>> >> than your
>> >> function on at least one example:
>> >>
>> >> intersectMat <- function(mat1, mat2)
>> >> {
>> >>     #mat1 and mat2 are both deduplicated
>> >>     nr1 <- nrow(mat1)
>> >>     nr2 <- nrow(mat2)
>> >>     mat2[duplicated(rbind(mat1, mat2))[(nr1 + 1):(nr1 + nr2)], ,
>> >> drop=FALSE]
>> >> }
>> >>
>> >> im2 <- function(mat1, mat2)
>> >> {
>> >>     stopifnot(ncol(mat1)==2, ncol(mat1)==ncol(mat2))
>> >>     toChar <- function(twoColMat) paste(sep="\1", twoColMat[,1],
>> >> twoColMat[,2])
>> >>     mat1[match(toChar(mat2), toChar(mat1), nomatch=0), ,
>drop=FALSE]
>> >> }
>> >>
>> >> > m1 <- cbind(1:1e7, rep(1:10, len=1e7))
>> >> > m2 <- cbind(1:1e7, rep(1:20, len=1e7))
>> >> > system.time(r1 <- intersectMat(m1,m2))
>> >>    user  system elapsed
>> >>  430.37    1.96  433.98
>> >> > system.time(r2 <- im2(m1,m2))
>> >>    user  system elapsed
>> >>   27.89    0.20   28.13
>> >> > identical(r1, r2)
>> >> [1] TRUE
>> >> > dim(r1)
>> >> [1] 5000000       2
>> >>
>> >> Bill Dunlap
>> >> Spotfire, TIBCO Software
>> >> wdunlap tibco.com
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: r-help-bounces at r-project.org
>> >[mailto:r-help-bounces at r-project.org]
>> >> On Behalf
>> >> > Of c char
>> >> > Sent: Monday, July 29, 2013 4:04 PM
>> >> > To: r-help at r-project.org
>> >> > Subject: [R] Intersecting two matrices
>> >> >
>> >> > Dear all,
>> >> >
>> >> > I am interested to know a faster matrix intersection package for
>R
>> >> handles
>> >> > intersection of two integer matrices with ncol=2. Currently I am
>> >using my
>> >> > homemade code adapted from a previous thread:
>> >> >
>> >> >
>> >> > intersectMat <- function(mat1, mat2){#mat1 and mat2 are both
>> >> > deduplicated  nr1 <- nrow(mat1)  nr2 <- nrow(mat2)
>> >> > mat2[duplicated(rbind(mat1, mat2))[(nr1 + 1):(nr1 + nr2)], ]}
>> >> >
>> >> >
>> >> > which handles:
>> >> > size A= 10578373
>> >> > size B= 9519807
>> >> > expected intersecting time= 251.2272
>> >> > intersecting for corssing MPRs took 409.602 seconds.
>> >> >
>> >> > scale a little bit worse than linearly but atomic operation is
>not
>> >good.
>> >> > Wonder if a super fast C/C++ extension exists for this task.
>Your
>> >ideas
>> >> are
>> >> > appreciated.
>> >> >
>> >> > Thanks!
>> >> >
>> >> >       [[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible
>code.
>> >>
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help at r-project.org mailing list
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list