[BioC] assemble DNAString complete coding sequence from exons?

Thu Mar 26 00:32:33 CET 2009

Hi Paul,

You can now use c() on XString objects (and more generally on XRaw objects).
See ?XRaw for some examples (the examples given in this man page use XRaw
objects but they translate directly to XString objects which are a particular
type of XRaw objects).

I've also added the xscat() function which is an equivalent of
paste(..., sep="") for XString/XStringSet/XStringViews objects.
See ?xscat

Regarding your February's request for more support for modifying an XString
object (https://stat.ethz.ch/pipermail/bioconductor/2009-February/026209.html),
I've added a subseq() replacement method (subseq<-) for XRaw/XString objects.
Again, some examples with XRaw objects are given in ?XRaw.

Here are a couple of more advanced examples with a chromosome sequence:

   (a) Delete regions specified by their coordinates:

         v <- Views(chrom, start=region_starts, end=regions_ends)
         do.call(c, as.list(gaps(v)))

       Note that 'v' could be the result of a call to
       matchPattern(some_pattern, chrom). This provides an easy
       way to delete patterns from a chromosome.

   (b) Modifying a chromosome:

         library(BSgenome.Dmelanogaster.UCSC.dm3)
         chr2L <- unmasked(Dmelanogaster$chr2L)
         # delete the first 1000 bases:
         subseq(chr2L, end=1000) <- NULL
         # insert 5 As right after base 6:
         subseq(chr2L, end=6, width=0) <- DNAString("AAAAA")
         # replace base -10 (base 10 counting from the 3' end) by 2 Gs:
         subseq(chr2L, start=-10, width=1) <- DNAString("GG")

Note that these new functionalities don't work (yet) on MaskedXString
objects. They are available in BioC devel only starting with
Biostrings 2.11.44 and IRanges 1.1.55. They should propagate to the
public repo in the next 24 hours but you can get them from svn if you
need them now. Your feedback is welcome.

Cheers,
H.

 > sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-11 r47901)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_CA.UTF-8;LC_NUMERIC=C;LC_TIME=en_CA.UTF-8;LC_COLLATE=en_CA.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_CA.UTF-8;LC_PAPER=en_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome_1.11.13   Biostrings_2.11.44 IRanges_1.1.55

loaded via a namespace (and not attached):
[1] Biobase_2.3.10

Paul Shannon wrote:
> I am sure there is an elegant way to do this.  Could somebody clue me in?
> 
> I have (in a simple case) two exons for a gene on the + strand, and a 
> the full DNAString sequence of its chromosome.
> 
> My naive technique for constructing a DNAString of the entire coding 
> sequence is
> 
>   1) paste together toString (subseq (seq.chrom, exon.start, exon.end)) 
> for each exon
>   2) construct a DNAString from the resulting chars.
> 
> There must be a better way.  What is it?
> 
> Thanks!
> 
>  - Paul
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319