[BioC] how to operate on a DNAStringSet object

Sat Mar 23 03:18:39 CET 2013

Hi Chris, Steve,

I tried Steve's 'endoapply(xx, sample)' and was surprised/disappointed
that it's about 10x slower than doing lapply() :-/
The reason for that is that it was actually calling the "endoapply"
method for List objects, which is very inefficient.

I just added an "endoapply" method for XVectorList objects (DNAStringSet
objects are just particular XVectorList objects) that is as fast as
doing lapply():

   > library(Biostrings)
   > library(hgu95av2probe)
   > probes <- DNAStringSet(hgu95av2probe)
   > system.time(probes2 <- endoapply(probes, sample))
      user  system elapsed
   389.928   0.188 390.935

Still not particularly fast but 10x faster than before.

This is in Biostrings 2.27.12 (devel).

A faster (but more complicated) way to go is to use a 3-step approach:
(1) unlist 'probes' (produces a single big DNAString object), (2)
operate on the unlisted object, and (3) relist it:

   (1) Unlist:

         unlisted <- unlist(probes)

   (2) Shuffle the nucleotides in 'unlisted'. One complication is that
       the shuffling must not move nucleotides across the boundaries
       of the probes. This requires a little bit of extra work:

         probes_width <- width(probes)
         shuffling_idx <- IntegerList(lapply(probes_width, sample))
         offsets <- cumsum(c(0L, probes_width[-length(probes_width)]))
         shuffling_idx <- unlist(shuffling_idx + offsets)

         ## Sanity check.
         stopifnot(identical(sort(shuffling_idx), seq_along(unlisted)))

         unlisted2 <- unlisted[shuffling_idx]

   (3) Relist. Unfortunately, there is no "relist' method for DNAString
       objects at the moment (I still need to add one). In the mean
       time we can do:

         probes2b <- as(successiveViews(unlisted2, width(probes)),
                        "DNAStringSet")

With this method, it takes only about 10 sec on my laptop to shuffle
the nucleotides of all the probes in hgu95av2probe. If you are careful
to set the RNG seed to the same value (via set.seed) before you run
endoapply() and the 3-step approach, you'll end up with exactly the same
result i.e. 'identical(as.character(probes2b), as.character(probes2))'
will be TRUE.

Cheers,
H.

On 03/21/2013 02:07 PM, Steve Lianoglou wrote:
> And upon one more second's worth of inspection, the endoapply
> suggestion actually is a for-loop under the covers, so you won't be
> buying time ... I guess the lapply will go faster ...
>
> -steve
>
> On Thu, Mar 21, 2013 at 5:05 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>> Hi,
>>
>> On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel at phaget4.org> wrote:
>> [aggressive clipping]
>>
>>> What's odd, is that this actually works:
>>>
>>> DNAStringSet(do.call(c,unlist(myRandomizedseqs)))
>>>
>>> *IF* the sequences are NOT NAMED.
>>
>> This (or similar things) have come up before on the ML, but I don't
>> have time to search for it right now. I posted a suggestion that I use
>> "unname" defensively to sidestep these corner cases. Perhaps that will
>> help you find the thread when searching the archives. In any event,
>> you could do:
>>
>> R> DNAStringSet(do.call(c, unname(unlist(...))))
>>
>> Now that I look at your example, I think the thread I'm talking about
>> might have been slightly different, but I guess this should still work
>> in your case.
>>
>>> How does one operate on the sequences of a DNAStringSet object without
>>> getting a list back, or without a for loop? I'm sure there's some
>>> elegant one-liner that completely escapes me.
>>
>> To randomize the sequences, you could do:
>>
>> R> xx <- DNAStringSet(c("GATACA", "GATCCTAA"))
>> R> endoapply(xx, sample)
>>    A DNAStringSet instance of length 2
>>      width seq
>> [1]     6 ACGATA
>> [2]     8 GTCATAAC
>>
>> Where did that come from, right?
>>
>> Note that a DNAStringSet is an IRanges::Vector, and you'll find lots
>> of things in the IRangesOverview vignette, which at first might seem
>> like to long/detailed to read, but will be worth your time.
>>
>> Not sure how fast this will be on large XStringSet object, though. You
>> may not buy yourself more speed than the for loop, but can't test that
>> right now. Perhaps lapply(DNAstringSet, sample) might be faster, but
>> I'll leave this as an exercise for the reader.
>>
>> HTH,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Defender of The Thesis
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319