[R] Programming R to avoid loops

Charles C. Berry ccberry at ucsd.edu
Sat Apr 18 19:48:17 CEST 2015


On Sat, 18 Apr 2015, Brant Inman wrote:

> I have two large data frames with the following structure:
>
>> df1
>  id       date test1.result
> 1  a 2009-08-28      1
> 2  a 2009-09-16      1
> 3  b 2008-08-06      0
> 4  c 2012-02-02      1
> 5  c 2010-08-03      1
> 6  c 2012-08-02      0
>
>> df2
>  id       date test2.result
> 1  a 2011-02-03      1
> 2  b 2011-09-27      0
> 3  b 2011-09-01      1
> 4  c 2009-07-16      0
> 5  c 2009-04-15      0
> 6  c 2010-08-10      1
>

> I need to match items in df2 to those in df1 with specific matching 
> criteria. I have written a looped matching algorithm that works, but it 
> is very slow with my large datasets. I am requesting help on making a 
> version of this code that is faster and “vectorized" so to speak.

As I see in your posted code, you match id's exactly, dates according to a 
range, and count the number of positive test result in the second 
data.frame.

For this, the countOverlaps() function of the GenomicRanges package will 
do the trick with suitably defined GRanges objects. Something like:

require(GenomicRanges)

date1 <- as.integer( as.Date( df1$date, "%Y-%m-%d" ))
date2 <- as.integer( as.Date( df2$date, "%Y-%m-%d" ))

lagdays <- 30L
predays <- -30L

gr1 <- GRanges(seqnames=df1$id, IRanges(start=date1,width=1),strand="*")

gr2 <- GRanges(seqnames=df2$id,
                IRanges(start=date2+predays,end=date2+lagdays),
                strand="*")[ df2$test2.result==1,]

df1$test2.count <- countOverlaps(gr1,gr2)


For the example data.frames (as rendered by Jim Lemon's code), this yields

> df1
   id       date test1.result test2.count
1  a 2009-08-28            1           0
2  a 2009-09-16            1           0
3  b 2008-08-06            0           0
4  c 2012-02-02            1           0
5  c 2010-08-03            1           1
6  c 2012-08-02            0           0

The GenomicRanges package is at

http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html

where you will find installation instructions and links to vignettes.

HTH,

Chuck


More information about the R-help mailing list