[BioC] Manipulating large DNAStringSet(s)

Steve Lianoglou mailinglist.honeypot at gmail.com
Wed May 5 17:16:20 CEST 2010


Hi all,

I have a rather large DNAStringSet you could get from, say,
subselecting many intervals on Hsapiens$chr1 and I want to go through
the paces of reverseComplement-ing some of the ranges that correspond
to reads on the anti-sense strand.

I'm finding that doing this the "naive" way is taking a lot of time
and memory, but adding some extra code to change my target to a
character vector (from a DNAStringSet) with some careful subsetting
and reverseComplement-ing is much faster.

So I'm wondering if I'm either doing something wrong, or perhaps
something in DNAStringSet can be optimized to deal with this?

Like so:

R> library(BSgenome.Hsapiens.UCSC.hg18)
R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100,
length.out=2141404), width=15)
R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE)
R> is.pos <- strands == '+'
R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in
the tail here

I now want to reverseComplement the strings in v where strands == '-'

R> strings <- DNAStringSet(v)
R> strings[!is.pos] <- reverseComplement(strings[!is.pos])

That above command takes a very long time to complete ... it's
actually been over a few minutes on my machine, so I'm killing it.

Now let's do it another way, using normal character vectors and stuff

R> strings <- DNAStringSet(v)
R> result <- character(length(strings))
R> result[is.pos] <- as.character(strings[is.pos])
R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos]))
## and optionally ##
R> result <- DNAStringSet(result)

Just curious if there's a better way?

R> sessionInfo()
R version 2.11.0 (2010-04-22)
x86_64-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0
   Biostrings_2.16.0                  GenomicRanges_1.0.1
[5] IRanges_1.6.0

loaded via a namespace (and not attached):
[1] Biobase_2.8.0


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list