[BioC] BioStrings questions

Sat Oct 18 18:56:20 CEST 2008

Hi Herve,
many thanks for the informations.
After digging into the man pages of the Biostrings package I found the 
countPatter function.
It is nice because it returns me the simple presence of the pattern 
allowing a certain number of mismatches

score.tmp <- countPattern(PSR.seq , DNAString(as.character(exs.seq[j])), 
max.mismatch=3)

I found it very useful if a limited number of sequences are queried, 
however, when I work with the full refseq fasta file blast is much faster.
Cheers
Raffaele

hpages at fhcrc.org wrote:
> Hi Raffaele,
>
> The PSR do not need to be of equal length with PDict/matchPDit.
> An important limitation though is that indels are not supported
> for now (but will be in the very near future).
> Another limitation is that you need to define a Trusted Band on
> your PDict object i.e. a vertical band of constant width where
> mismatches are not allowed. For example:
>
>   psrPDict <- PDict(PSRDNAStringSet, start=5, end=15)
>
> will set a Trusted Band of width 11 that covers nucleotides
> 5 to 15 of each PSR.
> Then use the 'max.mismatch' arg of matchPDict() to control the max
> number of mismatches that you want to allow *outside* the Trusted
> Band for each PSR (the nucleotides that are *inside* must match
> exactly).
> Note that even if you want to perform exact matching, you need
> to define a Trusted Band if your PSR are not of equal length.
> It's important to try to define the widest possible Trusted Band
> since the performance of the underlying algo depends directly
> on how wide it is. Typically the performance will be good if the
> width is >= 12 and will start to drop significantly for smaller
> values.
> Of course if you want to use you PDict object for exact matching
> only, then you want to set the Trusted Band to the max possible
> width i.e. to min(width(PSRDNAStringSet)).
>
> Hope this helps,
> H.
>
>
> Quoting rcaloger <raffaele.calogero at gmail.com>:
>
>> Dear Patrick,
>> many thanks for the quick answer. Unfortunately PSR can be located in
>> any position of the refseq and the sequence length can be very
>> different. Therefore, I cannot apply your suggestion.
>> Cheers
>> Raffaele
>>
>> Patrick Aboyoun wrote:
>>> Raffaele,
>>> The pairwiseAlignment function uses an O(nm) {with n and m being  
>>> the length of the two sequences being aligned} dynamic programming  
>>> algorithm that is designed to find an optimal alignment and as you  
>>> have discovered isn't intended for use with a long reference  
>>> sequence. Do your PSR sequences map nearly exactly to a location on 
>>>  your reference sequence and are these sequences of equal length? 
>>> If  so, see the matchPDict function. It matches a pattern 
>>> dictionary  consisting of equal length fragments to a reference 
>>> sequence. The  pseudo code looks something like:
>>>
>>> psrPDict <- PDict(PSRDNAStringSet)
>>> matchPDict(psrPDict, refseq)
>>>
>>> To answer your second question, the append function should get you  
>>> what you want:
>>>
>>>> append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", 
>>>> "TTTACCC")))
>>> A DNAStringSet instance of length 4
>>>   width seq
>>> [1]     3 AAA
>>> [2]     2 GA
>>> [3]     4 ACTG
>>> [4]     7 TTTACCC
>>>
>>>
>>> Patrick
>>>
>>>
>>> rcaloger wrote:
>>>> Hi,
>>>> In my onechannelGUI package I am developing a section related to  
>>>> Affymetrix exon array analysis, creating few functions that allow  
>>>> the association of exon-level Probe Selection Region (PSR) to refseq
>>>>
>>>> 1st question:
>>>> I have implemented a function that blast a list of PSR sequences  
>>>> over all refseq.
>>>> However, I would like to know if there is any way of doing  
>>>> something similar using the Biostring package.
>>>> I tried the pairwiseAlignment function but it is quite slow  
>>>> compared to blast.
>>>>
>>>> 2nd question:
>>>> there is any way of merging two DNAStringSets ?
>>>>
>>>> Cheers
>>>> Raffaele
>>>>
>>>>
>>>>
>>>
>>
>>
>> -- 
>>
>> ----------------------------------------
>> Prof. Raffaele A. Calogero
>> Bioinformatics and Genomics Unit
>> Dipartimento di Scienze Cliniche e Biologiche
>> c/o Az. Ospedaliera S. Luigi
>> Regione Gonzole 10, Orbassano
>> 10043 Torino
>> tel.   ++39 0116705417
>> Lab.   ++39 0116705408
>> Fax    ++39 0119038639
>> Mobile ++39 3333827080
>> email: raffaele.calogero at unito.it
>>       raffaele[dot]calogero[at]gmail[dot]com
>> www:   http://www.bioinformatica.unito.it
>> Info: http://publicationslist.org/raffaele.calogero
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>

-- 

----------------------------------------
Prof. Raffaele A. Calogero
Bioinformatics and Genomics Unit
Dipartimento di Scienze Cliniche e Biologiche
c/o Az. Ospedaliera S. Luigi
Regione Gonzole 10, Orbassano
10043 Torino
tel.   ++39 0116705417
Lab.   ++39 0116705408
Fax    ++39 0119038639
Mobile ++39 3333827080
email: raffaele.calogero at unito.it
       raffaele[dot]calogero[at]gmail[dot]com
www:   http://www.bioinformatica.unito.it
Info: http://publicationslist.org/raffaele.calogero