[BioC] Manipulating large DNAStringSet(s)

Thu May 6 02:27:14 CEST 2010

Hi Steve,

Thanks for reporting this! I fixed this problem in IRanges 1.6.1
(release) and 1.7.1 (devel). Both will be available thru biocLite()
in the next 24 hours or so.

library(BSgenome.Hsapiens.UCSC.hg18)
ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100, 
length.out=2141404), width=15)
strands <- sample(c('+', '-'), length(ranges), replace=TRUE)
is.pos <- strands == '+'
v <- Views(Hsapiens$chr1, ranges)

strings <- DNAStringSet(v)

## Takes about 2.5 sec on my laptop:
strings[!is.pos] <- reverseComplement(strings[!is.pos])

Cheers,
H.

Steve Lianoglou wrote:
> Hi all,
> 
> I have a rather large DNAStringSet you could get from, say,
> subselecting many intervals on Hsapiens$chr1 and I want to go through
> the paces of reverseComplement-ing some of the ranges that correspond
> to reads on the anti-sense strand.
> 
> I'm finding that doing this the "naive" way is taking a lot of time
> and memory, but adding some extra code to change my target to a
> character vector (from a DNAStringSet) with some careful subsetting
> and reverseComplement-ing is much faster.
> 
> So I'm wondering if I'm either doing something wrong, or perhaps
> something in DNAStringSet can be optimized to deal with this?
> 
> Like so:
> 
> R> library(BSgenome.Hsapiens.UCSC.hg18)
> R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100,
> length.out=2141404), width=15)
> R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE)
> R> is.pos <- strands == '+'
> R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in
> the tail here
> 
> I now want to reverseComplement the strings in v where strands == '-'
> 
> R> strings <- DNAStringSet(v)
> R> strings[!is.pos] <- reverseComplement(strings[!is.pos])
> 
> That above command takes a very long time to complete ... it's
> actually been over a few minutes on my machine, so I'm killing it.
> 
> Now let's do it another way, using normal character vectors and stuff
> 
> R> strings <- DNAStringSet(v)
> R> result <- character(length(strings))
> R> result[is.pos] <- as.character(strings[is.pos])
> R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos]))
> ## and optionally ##
> R> result <- DNAStringSet(result)
> 
> Just curious if there's a better way?
> 
> R> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-apple-darwin9.8.0
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0
>    Biostrings_2.16.0                  GenomicRanges_1.0.1
> [5] IRanges_1.6.0
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0
> 
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319