[BioC] Biostrings - vcountPattern optimization

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Jul 22 18:19:21 CEST 2010


Hi,

On Thu, Jul 22, 2010 at 11:54 AM, Erik Wright <eswright at wisc.edu> wrote:
> Hello,
>
> Lately I have been working on counting sequence fragments in larger sets of sequences.  I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases.  Currently I am using the following method to count the number of "hits":

Would using bowtie as an intermediary be an option?

For instance, you could consider:

(i) making a bowtie-index out of your 1200-1600 bp "references"
(ii) aligning your 30-130bp fragments agains it and output to SAM
format (give each read a unique id so you can hunt for it later)
(iii) convert SAM -> indexed BAM
(iv) process bam file w/ Rsamtools -- perhaps you could simply do a
`table()` on the sequence IDs of each alignment if all you want is a
count -- of course now that the sequences are aligned, the data is in
"good shape" to do other types of analyses as well (whatever it is
that you're doing).

> #### start ####
> library(Biostrings)
> fragments <- DNAStringSet(c("ACTG","AAAA"))
> sequence_set <- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
>
> for (i in 1:length(fragments)) {
>        counts <- vcountPattern(fragments[[i]],
>                sequence_set,
>                max.mismatch=1)
>        hits <- length(which(counts > 0))
>        print(hits)
> }
> #### end ####
>
> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner?  Does anyone have a suggestion for how I can accomplish the same task more efficiently?

I don't really have any suggestions to make the above R code run
faster ... sorry.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list