[BioC] Non ambiguous DNA runs

Harris A. Jaffee hj at jhu.edu
Sat Apr 14 05:28:07 CEST 2012


This will get you most of the way.

> subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA"
> s = DNAString(subject)
> F = letterFrequencyInSlidingView(s, letters="ACGT", view.width=1)

> Rle(F)
'integer' Rle of length 40 with 6 runs
  Lengths:  2 20  1 13  2  2
  Values :  0  1  0  1  0  1

On Apr 13, 2012, at 7:17 PM, Marcus Davy wrote:

> Hi,
> I am wanting to find the longest run of non ambiguous DNA sequence (A, C,
> G, T only) from an example DNAString, e.g. sequences from sanger reads.
> 
> Is there a simple Biostrings function to do this that I am missing?
> 
> An approach I thought of was to use mask() to identify the ambiguous base
> positions, so they (and their converse) can be visualized as a
> XStringViews object, however in extracting the ambiguous base positions
> consensusMatrix() wouldn't work with the baseOnly option if the input was a
> DNAString.
> 
> 
> ## Ambiguous bases
> 
> names(IUPAC_CODE_MAP)[-(1:4)]
> 
> 
> 
> subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA"
> 
> 
> 
> ## Find ambiguous base positions
> 
> ## pattern matching outside biostrings
> 
> gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]]
> 
> 
> 
> ## A DNAString class fails with baseOnly option
> 
> try(which(consensusMatrix(DNAString(subject), baseOnly=TRUE)["other",]==1))
> 
> 
> 
> ## Find ambiguous bases, forced to use DNASetingSet(subject)
> 
> ambGaps <- which(consensusMatrix(DNAStringSet(subject),
> baseOnly=TRUE)["other",]==1)
> 
> 
> 
> masked  <- mask(subject, start=ambGaps, end=ambGaps)
> 
> masked
> 
> 
>  Views on a 40-letter BString subject
> 
> subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA
> 
> views:
> 
>    start end width
> 
> [1]     3  22    20 [CTTGTAAAAACTTACACGAA]
> 
> [2]    24  36    13 [ATCGGCAGAGAAG]
> 
> [3]    39  40     2 [CA]
> 
> 
> ## Return longest masked sequence
> 
> ind     <- which.max(width(masked))
> 
> masked[[ind]]
> 
>  20-letter "BString" instance
> 
> seq: CTTGTAAAAACTTACACGAA
> 
> 
> thanks,
> 
> Marcus
> 
> 
>> sessionInfo()
> 
> R version 2.15.0 (2012-03-30)
> 
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
> 
> 
> locale:
> 
> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
> 
> 
> attached base packages:
> 
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> 
> other attached packages:
> 
> [1] Biostrings_2.24.1   IRanges_1.14.2      BiocGenerics_0.2.0
> BiocInstaller_1.4.3
> 
> 
> loaded via a namespace (and not attached):
> 
> [1] stats4_2.15.0 tools_2.15.0
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list