[R] convert for loop into apply()

Charles C. Berry cberry at tajo.ucsd.edu
Sun Aug 3 20:23:43 CEST 2008


If length( levels( a1$cat ) ) is not very large, and the 
structure suggested in the toy example holds in the actual case (probes in 
a2 do not 'overlap' and are in order according to a2$st and a2$en, then 
this might do:

> unsplit(lapply(levels(a1$cat), function(x) {
+   tmp1 <- subset( a1, cat==x )
+   tmp2 <- subset( a2, cat==x)
+   findInterval( tmp1$en, tmp2$st )-findInterval( tmp1$st, tmp2$en+1 )
+ }), a1$cat)
[1] 1 1 2 2 0 1
>

HTH,

Chuck


On Sun, 3 Aug 2008, Bill.Venables at csiro.au wrote:

> Here is a way to speed up your toy example:
>
> _________________
> a1 <- data.frame(id = 1:6,
>                 cat = paste('cat', rep(1:3, c(2,3,1))),
>                 st = c(1, 7, 30, 40, 59, 91),
>                 en = c(5, 25, 39, 55, 70, 120))
>
> a2 <- data.frame(id = paste('probe', 1:8),
>                 cat = paste('cat', rep(1:3, c(2,3,3))),
>                 st = c(1, 9, 20, 38, 53, 70, 80, 95),
>                 en = c(6, 15, 36, 43, 58, 75, 85, 98))
>
> c1 <- outer(a2$st , a1$en , "<=")
> c2 <- outer(a2$en , a1$st , ">=")
> c3 <- outer(a2$cat, a1$cat, "==")
>
> a1$coverage <- colSums(c1*c2*c3)
> __________________
>
> This won't work in one step if a1 has 30000 rows and a2 has 200000,
> unless you memory size is approximately infinite, so you will need a
> loop.  Suppose you can handle 1000 probes at a time.  You might be able
> to get away with something like this:
> __________________
>
> chunk <- 1000  ### make as large as possible!
>
> checkCoverage <- function(a1, a2)
> colSums(outer(a2$st , a1$en , "<=") *
>        outer(a2$en , a1$st , ">=") *
> 	   outer(a2$cat, a1$cat, "=="))
>
> coverage <- numeric(N <- nrow(a2))
> m2 <- 0
> while(m2 < N) {
> 	m1 <- m2 + 1
> 	m2 <- min(m2 + chunk, N)
> 	coverage[m1:m2] <- checkCoverage(a1, a2[m1:m2, ])
> }
> a1$coverage <- coverage
> __________________
>
> (Warning: untested code.)
>
> Failing that, go to C code and be done with it.
>
> BTW, why did you think apply() was going to be useful here?
>
>
> Bill Venables
> CSIRO Laboratories
> PO Box 120, Cleveland, 4163
> AUSTRALIA
> Office Phone (email preferred): +61 7 3826 7251
> Fax (if absolutely necessary):  +61 7 3826 7304
> Mobile:                         +61 4 8819 4402
> Home Phone:                     +61 7 3286 7700
> mailto:Bill.Venables at csiro.au
> http://www.cmis.csiro.au/bill.venables/
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of Anh Tran
> Sent: Saturday, 2 August 2008 4:04 PM
> To: rlist
> Subject: [R] convert for loop into apply()
>
> Hi all,I know this topic has came up multiple times, but I've never
> fully
> understand the apply() function.
>
> Anyway, I'm here asking for your help again to convert this loop to
> apply().
>
> I have 2 data frames with the following information: a1 is the fragment
> that
> is need to be covered, a2 is the probes that cover the specific
> fragment.
>
> I need to count the number of probes cover every given fragment (they
> need
> to have the same cat ID to be on the same fragment)
>
> a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat 2','cat 2','cat
> 2','cat
> 3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
> a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat 1','cat 1','cat
> 2','cat
> 2','cat 2','cat 3','cat 3','cat 3'), st=c(1,9,20,38,53,70,80,95),
> en=c(6,15,36,43,58,75,85,98));
> a1$coverage<-NULL;
>
> I came up with this for loop (basically, if a probe starts before the
> fragment end, and end after a fragment start, it cover that fragment)
>
> for (i in 1:length(a1$id))
> {
> a1$coverage[i]<-length(a2[a2$st<=a1$en[i]&a2$en>=a1$st[i]&a2$cat==a1$cat
> [i],]$id);
> }
>
>> a1$coverage
> [1] 1 1 2 2 0 1
>
>
> This loop runs awefully slow when I have 200,000 probes and 30,000
> fragments. Is there anyway I can speed this up with apply()?
>
> This is the time for my for loop to scan through the first 20 record of
> my
> dataset:
>   user  system elapsed
>  2.264   0.501   2.770
>
> I think there is room for improvement here. Any idea?
>
> Thanks
> -- 
> Regards,
> Anh Tran
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list