[BioC] BioStrings questions

Fri Oct 17 18:43:22 CEST 2008

Hi Raffaele,

The PSR do not need to be of equal length with PDict/matchPDit.
An important limitation though is that indels are not supported
for now (but will be in the very near future).
Another limitation is that you need to define a Trusted Band on
your PDict object i.e. a vertical band of constant width where
mismatches are not allowed. For example:

   psrPDict <- PDict(PSRDNAStringSet, start=5, end=15)

will set a Trusted Band of width 11 that covers nucleotides
5 to 15 of each PSR.
Then use the 'max.mismatch' arg of matchPDict() to control the max
number of mismatches that you want to allow *outside* the Trusted
Band for each PSR (the nucleotides that are *inside* must match
exactly).
Note that even if you want to perform exact matching, you need
to define a Trusted Band if your PSR are not of equal length.
It's important to try to define the widest possible Trusted Band
since the performance of the underlying algo depends directly
on how wide it is. Typically the performance will be good if the
width is >= 12 and will start to drop significantly for smaller
values.
Of course if you want to use you PDict object for exact matching
only, then you want to set the Trusted Band to the max possible
width i.e. to min(width(PSRDNAStringSet)).

Hope this helps,
H.

Quoting rcaloger <raffaele.calogero at gmail.com>:

> Dear Patrick,
> many thanks for the quick answer. Unfortunately PSR can be located in
> any position of the refseq and the sequence length can be very
> different. Therefore, I cannot apply your suggestion.
> Cheers
> Raffaele
>
> Patrick Aboyoun wrote:
>> Raffaele,
>> The pairwiseAlignment function uses an O(nm) {with n and m being   
>> the length of the two sequences being aligned} dynamic programming   
>> algorithm that is designed to find an optimal alignment and as you   
>> have discovered isn't intended for use with a long reference   
>> sequence. Do your PSR sequences map nearly exactly to a location on  
>>  your reference sequence and are these sequences of equal length?  
>> If  so, see the matchPDict function. It matches a pattern  
>> dictionary  consisting of equal length fragments to a reference  
>> sequence. The  pseudo code looks something like:
>>
>> psrPDict <- PDict(PSRDNAStringSet)
>> matchPDict(psrPDict, refseq)
>>
>> To answer your second question, the append function should get you   
>> what you want:
>>
>>> append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", "TTTACCC")))
>> A DNAStringSet instance of length 4
>>   width seq
>> [1]     3 AAA
>> [2]     2 GA
>> [3]     4 ACTG
>> [4]     7 TTTACCC
>>
>>
>> Patrick
>>
>>
>> rcaloger wrote:
>>> Hi,
>>> In my onechannelGUI package I am developing a section related to   
>>> Affymetrix exon array analysis, creating few functions that allow   
>>> the association of exon-level Probe Selection Region (PSR) to refseq
>>>
>>> 1st question:
>>> I have implemented a function that blast a list of PSR sequences   
>>> over all refseq.
>>> However, I would like to know if there is any way of doing   
>>> something similar using the Biostring package.
>>> I tried the pairwiseAlignment function but it is quite slow   
>>> compared to blast.
>>>
>>> 2nd question:
>>> there is any way of merging two DNAStringSets ?
>>>
>>> Cheers
>>> Raffaele
>>>
>>>
>>>
>>
>
>
> -- 
>
> ----------------------------------------
> Prof. Raffaele A. Calogero
> Bioinformatics and Genomics Unit
> Dipartimento di Scienze Cliniche e Biologiche
> c/o Az. Ospedaliera S. Luigi
> Regione Gonzole 10, Orbassano
> 10043 Torino
> tel.   ++39 0116705417
> Lab.   ++39 0116705408
> Fax    ++39 0119038639
> Mobile ++39 3333827080
> email: raffaele.calogero at unito.it
>       raffaele[dot]calogero[at]gmail[dot]com
> www:   http://www.bioinformatica.unito.it
> Info: http://publicationslist.org/raffaele.calogero
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor