[BioC] Biostrings::matchPattern,extract sequences

Zhu, Lihua (Julie) Julie.Zhu at umassmed.edu
Wed Sep 18 01:51:29 CEST 2013


Cool. Thanks, Herve!

Is there a method to extract the mismatch position for all the matches?
Right now, I am using pairwiseAlignment for each matched subsequence.
However, this could become very slow when the number of matched sequences
gets large.
        

Best regards,

Julie


On 9/17/13 6:52 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:

> Hi Julie,
> 
> Sorry for the late answer.
> 
> On 09/11/2013 02:43 PM, Zhu, Lihua (Julie) wrote:
>> Herve,
>> 
>> Is there a more elegant way to get all matched reference sequences
>> besides using subject(matches)[start:end], e.g, subject(matches)[3010894
>> : 3010916] for each matched record? Thanks!
>>    23-letter "DNAString" instance
>> seq: GCGGAGCCTGAGGCAGAAACCTC
>> 
>> matches is the object returned by matchPattern function call.
>> 
>> matches
>>    Views on a 197195432-letter DNAString subject
>> subject:
>> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...CCTATTCT
>> AGTAAAAATTTTATTTCATTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC
>> views:
>>             start       end width
>>     [1]   3010894   3010916    23 [GCGGAGCCTGAGGCAGAAACCTC]
>>     [2]   3299593   3299615    23 [GCTGTGGCTGAGATGAATACTGG]
>>     [3]   3637189   3637211    23 [CCTGCTTCTGCCTCTGCCACCGG]
>>     [4]   3660740   3660762    23 [GCTGTTGCTGCCGCTGTTGGTGG]
>>     [5]   3661169   3661191    23 [GCTGCCCCGGCCGCCGCCGCCCG]
>>     [6]   3661721   3661743    23 [CCCGCGGCTGCAGCACGAGCCGC]
>> ....
> 
> Just turn this into a DNAStringSet object with a coercion:
> 
>    as(matches, "DNAStringSet")
> 
> or by calling the DNAStringSet() constructor on it:
> 
>    DNAStringSet(matches)
> 
> Cheers,
> H.
> 
>> 
>> Best regards,
>> 
>> Julie
>> 
>>   sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>> 
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>> 
>> attached base packages:
>> [1] grid      parallel  stats     graphics  grDevices utils     datasets
>>   methods   base
>> 
>> other attached packages:
>>   [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19  BiocInstaller_1.10.3
>>                 REDseq_1.6.0
>>   [4] ChIPpeakAnno_2.8.0                  GenomicFeatures_1.12.3
>>               limma_3.16.7
>>   [7] org.Hs.eg.db_2.9.0                  GO.db_2.9.0
>>                          RSQLite_0.11.4
>> [10] DBI_0.2-7                           AnnotationDbi_1.22.6
>>                 BSgenome.Ecoli.NCBI.20080805_1.3.17
>> [13] biomaRt_2.16.0                      VennDiagram_1.6.5
>>                    multtest_2.16.0
>> [16] Biobase_2.20.1
>>                       BSgenome.Celegans.UCSC.ce2_1.3.19   BSgenome_1.28.0
>> [19] ShortRead_1.18.0                    latticeExtra_0.6-26
>>                  RColorBrewer_1.0-5
>> [22] Rsamtools_1.12.4                    lattice_0.20-23
>>                      Biostrings_2.28.0
>> [25] GenomicRanges_1.12.5                IRanges_1.18.3
>>                       BiocGenerics_0.6.0
>> 
>> loaded via a namespace (and not attached):
>>   [1] bitops_1.0-6       hwriter_1.3        MASS_7.3-29
>>         RCurl_1.95-4.1     rtracklayer_1.20.4 splines_3.0.1
>>       stats4_3.0.1
>>   [8] survival_2.37-4    tools_3.0.1        XML_3.95-0.2
>>        zlibbioc_1.6.0



More information about the Bioconductor mailing list