[R] How should I improve the following R code?

jim holtman jholtman at gmail.com
Tue Jan 8 01:19:46 CET 2008


One thing to do is to use Rprof() on your script so that you can
determine where time is being spent.  My guess it that most of the
time is in the wtd.quantile function.  If your Counts don't get too
big, another way is to use 'quantile' directly:

> Index <- c(0,1,7,30)
> Count <- c(234,120,11,1)
> rep.int(Index, times=Count)
  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0
 [33]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0
 [65]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0
 [97]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0
[129]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0
[161]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0
[193]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0
[225]  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1
[257]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1
[289]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1
[321]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1
[353]  1  1  7  7  7  7  7  7  7  7  7  7  7 30
> quantile(rep.int(Index, times=Count), prob=c(0, .2, .5, .8, 1))
  0%  20%  50%  80% 100%
   0    0    0    1   30
>

Try both solutions and see which is faster.

On Jan 7, 2008 6:49 PM, Seung Jun <seungwjun at gmail.com> wrote:
> I'm looking for a way to improve code that's proven to be inefficient.
>
> Suppose that a data source generates the following table every minute:
>
>  Index  Count
>  ------------
>  0      234
>  1      120
>  7      11
>  30     1
>
> I save the tables in the following CSV format:
>
>  time,index,count
>  0,0:1:7:30,234:120:11:1
>  1,0:2:3:19,199:110:87:9
>
> That is, each line represents a table, and I have N lines for N minutes of
> data collection.
>
> Now, I wrote the following code to get quantiles for each time period:
>
>  library(Hmisc)
>  stbl  <- read.csv("data.csv")
>  index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric)
>  count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric)
>  len   <- length(index)
>  for (i in 1:len) {
>    v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1))
>    stbl$q0[i] <- v[1]
>    stbl$q2[i] <- v[2]
>    stbl$q5[i] <- v[3]
>    stbl$q8[i] <- v[4]
>    stbl$q10[i] <- v[5]
>  }
>
> It works fine for a small N, but it get quickly inefficient as N grows.  The
> for-loop takes too long.  How could I improve the code or data
> representation so it can run fast?
>
> Thanks,
> Seung
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?




More information about the R-help mailing list