[BioC] question about how to get ranges in Biostrings

Thu Feb 21 22:34:30 CET 2013

Hi Andress,

[cc'ing the list so others can benefit from or participate to the
  thread]

On 02/21/2013 06:43 AM, Andres Weinfeld Avalos Figueroa wrote:
> Dear Dr Pagés
> Greetings from Guatemala.
> I am somewhat new in using BioC, and an intermediate R user.
>
> Last couple of days I have been working with the matchProbePair function
> on Sorghum genome, testing around 3,500 SSR primer pairs generated in
> 2011 for sugarcane for the ICSB. I am using Sorghum as a framework due
> to its genome stability and ploidy (besides genetic closeness) I have
> succeeded in evaluating the 10 chromosomes, and obtained a interesting
> amount of stringviews.
>
> Now I want to plot those theoretical amplicons along S. bicolor genome
> (chromosome by crhromosome), but I have not found the way to substract
> the ranges from this list. Being "result" the matchprobepair outputI
> know that >range(result[[1]]) does it, but I want to use lapply()
> function...

To extract the ranges (as an IRanges object) from the result (a Views
object), just use ranges():

   > result <- matchProbePair(Fprobe, Rprobe, subject)
   > rg <- ranges(result)
   > rg
   IRanges of length 9
          start      end   width
   [1]  6523541  6524340     800
   [2]  7268925  8284316 1015392
   [3]  9019256  9295206  275951
   [4] 13503602 15492126 1988525
   [5] 17612494 19210625 1598132
   [6] 21840840 21862022   21183
   [7] 22281529 22329032   47504
   [8] 23045531 23096640   51110
   [9] 23759210 24263538  504329

If you want to plot those ranges, for example with the Gviz package:

   library(Gviz)
   plotTracks(list(GenomeAxisTrack(),
                   AnnotationTrack(GRanges("chr3R", rg))))

If you want to use lapply(), then:

   lapply(seq_along(rg),
          function(i) {
              ... do something with start(rg)[i] and end(rg)[i] ...
          })

Note that the plotting code and lapply code above would work directly
on the Views object (result) so there is no need to extract the
ranges in 'rg'.

If you want to extract the sequences of the amplicons, just pass
'result' to the DNAStringSet constructor:

   > amplicons <- DNAStringSet(result)
   > amplicons
     A DNAStringSet instance of length 9
         width seq
   [1]     800 
AGCTCCGAGTTCCTGCAATATTTGAACGAGGAC...CTTCCAAGCCATCCGCATATTTGTGAACAACG
   [2] 1015392 
AGCTCCGAGTTGAGATGAGGCTACCCACTTAAT...TATTTGGCCTCGGTCAAATCGAACTCGGAGCT
   [3]  275951 
AGCTCCGAGTTACGTGTATCGATTACCCCGATT...ACTAGGAAATAAATAAAGTAGAACTCGGAGCT
   [4] 1988525 
CGTTGTTCACATCCCTCTGCTGTTCCCGCTGAC...ACCGATGATAGAAAGAGTGTTAACTCGGAGCT
   [5] 1598132 
CGTTGTTCACAATCGTCAGATCGAAGAAGTGAC...CACTTGGAACTTGACACTTGGAACTCGGAGCT
   [6]   21183 
AGCTCCGAGTTCGTCGAGAATCTGCACGAGAAG...GCCATGTGCATAAGGGCTCGGAACTCGGAGCT
   [7]   47504 
AGCTCCGAGTTGCGCTTCCTGATGGGCACGGAT...CGGGTGTAAAAAGCTGAGAGCAACTCGGAGCT
   [8]   51110 
CGTTGTTCACATTTTGCTCAGAGCATTTGCATT...AAAACTAAACCCGCAGACTGATGTGAACAACG
   [9]  504329 
AGCTCCGAGTTCTAAAAAGAGCAGTTTTATCAG...GCTGCCAATAAAGGAAAAGGAAACTCGGAGCT

If you want to look at them individually, use [[ either on 'result'
or 'amplicons':

   > result[[3]]
     275951-letter "DNAString" instance
   seq: 
AGCTCCGAGTTACGTGTATCGATTACCCCGATTTTA...AAAAACTAGGAAATAAATAAAGTAGAACTCGGAGCT
   > amplicons[[3]]
     275951-letter "DNAString" instance
   seq: 
AGCTCCGAGTTACGTGTATCGATTACCCCGATTTTA...AAAAACTAGGAAATAAATAAAGTAGAACTCGGAGCT

If you want to loop on the sequences of the amplicons, you can use
lapply() directly on 'result' or 'amplicons':

   lapply(amplicons, ...)

But if you have a lot of amplicons (e.g. more than 10-20k), that's
probably going to take a long time. Note that a lot of basic
operations are serialized and fast even on big DNAStringSet objects:

   > alphabetFrequency(amplicons)
              A      C      G      T M R W S Y K V H D B N - +
    [1,]    216    196    218    170 0 0 0 0 0 0 0 0 0 0 0 0 0
    [2,] 287220 220536 221447 286189 0 0 0 0 0 0 0 0 0 0 0 0 0
    [3,]  79566  58099  58205  80081 0 0 0 0 0 0 0 0 0 0 0 0 0
    [4,] 573547 421203 423538 570237 0 0 0 0 0 0 0 0 0 0 0 0 0
    [5,] 453126 345037 344723 455246 0 0 0 0 0 0 0 0 0 0 0 0 0
    [6,]   5902   4635   4783   5863 0 0 0 0 0 0 0 0 0 0 0 0 0
    [7,]  13708  10336  10342  13118 0 0 0 0 0 0 0 0 0 0 0 0 0
    [8,]  12589  12586  12343  13592 0 0 0 0 0 0 0 0 0 0 0 0 0
    [9,] 142837 108612 110988 141892 0 0 0 0 0 0 0 0 0 0 0 0 0

Hope this helps,
H.

>
> This is a really basic question, I hope you can find a second to push me
> throug... I would really appreciate your help. I am not in my computer
> right now... don't have my work handy but in a couple of hours.
>
> Kind regards
> Andres Avalos
> Biotecnologist

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319