[R] Efficiency challenge: MANY subsets

Fri Jan 16 22:15:06 CET 2009

Try this one;  it is doing a list of 7000 in under 2 seconds:

>  sequences <- list(
+
+
+  c("M","G","L","W","I","S","F","G","T","P","P","S","Y","T","Y","L","L","I"
+ ,"M",
+
+
+  "N","H","K","L","L","L","I","N","N","N","N","L","T","E","V","H","T","Y","F",
"N","I","N","I","N","I","D","K","M","Y","I","H","*")
+  )
>
>
>
>  indexes <- list(
+   list(
+     c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51)
+   )
+  )
>
> indexes <- rep(indexes,10)
> sequences <- rep(sequences,7000)
>
> system.time({
+ fragments <- lapply(indexes, function(.seq){
+     lapply(.seq, function(.range){
+         .range <- seq(.range[1], .range[2])  # save since we use several times
+         lapply(sequences, '[', .range)
+     })
+ })
+ })
   user  system elapsed
   1.24    0.00    1.26
>
>

On Fri, Jan 16, 2009 at 3:16 PM, Johannes Graumann
<johannes_graumann at web.de> wrote:
> Thanks. Very elegant, but doesn't solve the problem of the outer "for" loop,
> since I now would rewrite the code like so:
>
> fragments <- list()
> for(iN in seq(length(sequences))){
>  cat(paste(iN,"\n"))
>  fragments[[iN]] <-
>    lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))])
> }
>
> still very slow for length(sequences) ~ 7000.
>
> Joh
>
> On Friday 16 January 2009 14:23:47 Henrique Dallazuanna wrote:
>> Try this:
>>
>> lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))])
>>
>> On Fri, Jan 16, 2009 at 11:06 AM, Johannes Graumann <
>>
>> johannes_graumann at web.de> wrote:
>> > Hello,
>> >
>> > I have a list of character vectors like this:
>> >
>> > sequences <- list(
>> >
>> >
>> > c("M","G","L","W","I","S","F","G","T","P","P","S","Y","T","Y","L","L","I"
>> >,"M",
>> >
>> >
>> > "N","H","K","L","L","L","I","N","N","N","N","L","T","E","V","H","T","Y","
>> >F", "N","I","N","I","N","I","D","K","M","Y","I","H","*")
>> > )
>> >
>> > and another list of subset ranges like this:
>> >
>> > indexes <- list(
>> >  list(
>> >    c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51)
>> >  )
>> > )
>> >
>> > What I now want to do is to subset each entry in "sequences"
>> > (sequences[[1]]) with all ranges in the corresponding low level list in
>> > "indexes" (indexes[[1]]). Here is what I came up with.
>> >
>> > fragments <- list()
>> > for(iN in seq(length(sequences))){
>> >  cat(paste(iN,"\n"))
>> >  tmpFragments <- sapply(
>> >    indexes[[iN]],
>> >    function(x){
>> >      sequences[[iN]][seq.int(x[1],x[2])]
>> >    }
>> >  )
>> >  fragments[[iN]] <- tmpFragments
>> > }
>> >
>> > This works fine, but "sequences" contains thousands of entries and the
>> > corresponding "indexes" are sometimes hundreds of ranges long, so this
>> > whole
>> > process is EXTREMELY inefficient.
>> >
>> > Does somebody out there take the challenge and show me a way on how to
>> > speed
>> > this up?
>> >
>> > Thanks for any hints,
>> >
>> > Joh
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?