[BioC] Biostrings - vcountPattern optimization

Thu Jul 22 22:14:34 CEST 2010

Hi Erik,

On 07/22/2010 10:32 AM, Erik Wright wrote:
> Hi Patrick,
>
> Thanks, this looks promising.  Three possible complications are:
> (1)  The fragments are not all the same width.  Is this possible with Pdict?

Yes, but given requirement (2), you need another solution.

> (2)  I allow a variable number of mismatches based on each individual fragment's width.

So given (1) and (2), you could group your fragments by equal length,
make a PDict object for each group, and use a single number of
mismatches for that group (seems like this number only depends on
the length of the fragment).

> (3)  The fragments sometimes include ambiguity letters (IUPAC extended letters).

Unfortunately ambiguities are supported only in the subject at the
moment. But you could still treat them separately with vcountPattern()
in a loop.

>
> A more accurate example would be:
>
> #### start ####
> fragments<- DNAStringSet(c("ACS","NCCAGAA")) # no indels
> sequence_set<- DNAStringSet(c("ATAGCATACKACCA","GATTACGTACCADADATTACA") # variable widths
> for (i in 1:length(fragments)) {
> 	counts<- vcountPattern(fragments[[i]],
> 		sequence_set,
> 		max.mismatch=floor(length(fragments[[i]])/5)) # variable mis-matches
> 	hits<- length(which(counts>  0))
> 	print(hits)
> }
> #### end ####
>
> Do think it is possible to make this work Pdict for a speed improvement?

With max.mismatch being a fifth of the fragment length that means it
will be between 6 (for 30bp fragments) and 26 (for 130bp fragments).
Unfortunately, that's way too many mismatches PDict()/vcountPDict()
can handle.

Cheers,
H.

>
> Thanks again!,
> Erik
>
>
>
> On Jul 22, 2010, at 12:11 PM, Patrick Aboyoun wrote:
>
>> Erik,
>> Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive.
>>
>> library(Biostrings)
>> fragments<- DNAStringSet(c("ACTG","AAAA"))
>> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
>> pdict<- PDict(fragments)
>> counts<- vcountPDict(pdict, sequence_set)
>>
>>> counts
>>      [,1] [,2]
>> [1,]    0    0
>> [2,]    0    0
>>
>>> sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-07-18 r52554)
>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] Biostrings_2.17.26 IRanges_1.7.13
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.9.0 tools_2.12.0
>>
>>
>>
>>
>> Patrick
>>
>>
>> On 7/22/10 8:54 AM, Erik Wright wrote:
>>> Hello,
>>>
>>> Lately I have been working on counting sequence fragments in larger sets of sequences.  I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases.  Currently I am using the following method to count the number of "hits":
>>>
>>> #### start ####
>>> library(Biostrings)
>>> fragments<- DNAStringSet(c("ACTG","AAAA"))
>>> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
>>>
>>> for (i in 1:length(fragments)) {
>>> 	counts<- vcountPattern(fragments[[i]],
>>> 		sequence_set,
>>> 		max.mismatch=1)
>>> 	hits<- length(which(counts>   0))
>>> 	print(hits)
>>> }
>>> #### end ####
>>>
>>> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner?  Does anyone have a suggestion for how I can accomplish the same task more efficiently?
>>>
>>> Thanks!,
>>> Erik
>>>
>>>
>>>
>>>
>>>
>>>> sessionInfo()
>>>>
>>> R version 2.11.0 (2010-04-22)
>>> x86_64-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.16.0 IRanges_1.6.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.8.0
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319