[BioC] fast iterator over DNAString's?

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Mar 11 01:52:27 CET 2010


Hi Paul,

On Wed, Mar 10, 2010 at 7:30 PM, Paul Shannon
<pshannon at systemsbiology.org> wrote:
> I wish to trim a variable length sequence from the end of many thousands of DNAStrings in a DNAStringSet.
>
> The sequence to be trimmed is any recognizable chunk of a solexa short read adapter, which ends up on the end of, for example, 22nt miRNAs.  The adapter chunk might be found in the middle of a 35 base read, or it might be closer to the end.  In every case, I want to delete every base from the start of the adapter chunk to the end of the read.
>
> I imagine there might be a BString operation equivalent to sed.  See could be used ike this:
>
>  echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed s/TCGTATGCCGTC.*$//      --> GAAGCGGGATGATCTATC
>
> (where TCGTATGCCGTC is only part of the 21-base adapter, but is probably a long enough portion to be representative)
>
> Any way to do this with BStrings and friends?

There are a couple of ways you can go about this. Before discovering
the Biostrings::trimLRPatterns function, I rigged together something
to do this like so:

1. use a call to vmatchPattern(my.adapter.sequence, myDNAStringSet)
with some fast&loose values for max.mismatch to find the position in
each read in myDNAStringSet that the adapter might be in.

2. you can call the "startIndex" function on the object returned my
vmatchPattern to get a vector of start positions to cut against your
dnastringset.

I'll leave it as an exercise to the reader to stitch that together,
but as I mentioned before, I think this is what the trimLRPatterns
function is supposed to do, so you might just want to start/play with
that.

Also, there was some discussion about doing this in the
bioc-sig-sequencing mailing list. You might want to subscribe to that
and search the archive for some inspiration:

https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Hope that helps,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list