[BioC] how to use rtracklayer to retrieve sequence of list of coordinates

Hervé Pagès hpages at fhcrc.org
Wed Mar 3 03:03:21 CET 2010


Hi Sabrina, Michael,

It doesn't work for me either with a RangedData:

   > testTrack <- RangedData(
                  space=c("chr1", "chrY"),
                  ranges=IRanges(start=1:1000, width=50),
                  strand=c("+", "+", "-", "-"))
   > library(BSgenome.Mmusculus.UCSC.mm9)
   > getSeq(Mmusculus, testTrack)
   Error in IRanges:::.normargSEW(start, "start") :
     'start' must be a vector of integers

But you can work around this by using:

   seq <- getSeq(Mmusculus, space(testTrack),
                 start=start(testTrack), end=end(testTrack),
                 strand=strand(testTrack))

See ?getSeq for more information.

BTW it seems that your installation is not up-to-date (for example
you have Biostrings 2.14.5, but the current version is 2.14.12).
I strongly recommend that you update your installation first with:

   > source("http://bioconductor.org/biocLite.R")
   > update.packages(repos=biocinstallRepos(), ask=FALSE)

Cheers,
H.


sabrina s wrote:
> Hi, Michael:
> 
> See bellow:
> 
> 
> Error in IRanges:::.normargSEW(start, "start") :
>   'start' must be a vector of integers
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
> 
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.14.2
> [3] Biostrings_2.14.5                  rtracklayer_1.6.0
> [5] RCurl_1.3-1                        bitops_1.0-4.1
> [7] IRanges_1.4.10
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0 XML_2.6-0
> 
> 
> Thanks!
> 
> Sabrina
> 
> 
> On Tue, Mar 2, 2010 at 8:49 AM, Michael Lawrence
> <lawrence.michael at gene.com>wrote:
> 
>> Could you paste your sessionInfo() output please?
>>
>>
>> On Mon, Mar 1, 2010 at 7:37 PM, sabrina s <sabrina.shao at gmail.com> wrote:
>>
>>> Hi, Michael:
>>>
>>> I tried as you suggested, but I had the following error:
>>>
>>> Error in IRanges:::.normargSEW(start, "start") :
>>>   'start' must be a vector of integers
>>>
>>> my testTrack looks like this:
>>>
>>> RangedData with 76 rows and...
>>>          space                 ranges |        strand
>>>    <character>              <IRanges> |  <factor>
>>> 1         chr1 [137159091, 137159290] |        -
>>> 2         chr1 [155302742, 155302941] |       -
>>> etc
>>>
>>> Any suggestions? Thanks!
>>>
>>> Sabrina
>>>
>>>
>>> On Mon, Mar 1, 2010 at 6:02 PM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>>
>>>> On Mon, Mar 1, 2010 at 1:02 PM, sabrina s <sabrina.shao at gmail.com>wrote:
>>>>
>>>>> Hi, Steve:
>>>>> I am trying to get it worked without separating the chromosomes. I used
>>>>> the
>>>>> following code:
>>>>> seqQuery<-data.frame(
>>>>>        chr=currCoorList$chr,
>>>>>    start=currCoorList$starts,
>>>>>    end=currCoorList$ends,
>>>>>    strand=currCoorList$strand)
>>>>>    mySeq<- getSeq(Mmusculus, seqQuery$chr,
>>>>>         start=seqQuery$start, end=seqQuery$end,
>>>>>            strand=seqQuery$strand )
>>>>>
>>>>>
>>>>> But then I got an error:
>>>>> Error in reverseComplement(DNAStringSet(seqs[ii])) :
>>>>>  could not find function "copy"
>>>>>
>>>> Did I miss some depending package?
>>>>>
>>>> Don't know what's up with that, but you don't need to use a data.frame
>>>> there. You already had a RangedData above, just pass that to getSeq, i.e.
>>>>
>>>> getSeq(Mmusculus, testTrack)
>>>>
>>>>  Sabrina
>>>>> On Mon, Mar 1, 2010 at 2:31 PM, Steve Lianoglou <
>>>>> mailinglist.honeypot at gmail.com> wrote:
>>>>>
>>>>>> Hi Sabrina,
>>>>>>
>>>>>> On Mon, Mar 1, 2010 at 2:04 PM, sabrina s <sabrina.shao at gmail.com>
>>>>> wrote:
>>>>>>> Hi,all:
>>>>>>> I have a long list of coordinates, which was created by the
>>>>> following
>>>>>> code:
>>>>>>> testIranges<- IRanges(start=currCoorList $starts,
>>>>>>>        end=currCoorLis$ends)
>>>>>>>
>>>>>>>   testTrack<-GenomicData(testIranges,chrom=currCoorList$chr,
>>>>>>>        strand=currCoorList$strand,genome="mm9")
>>>>>>>
>>>>>>> I could then export the testTrack into .bed file and upload in the
>>>>> UCSC
>>>>>>> genome browser and get the sequences of the listed coordinate
>>>>>>> But I was thinking if there is a function from rtracklayer to
>>>>> retrieve
>>>>>>> sequence directly.
>>>>>> If you're trying to get coordinates from some reference genome, you
>>>>>> can use the BSgenome.*.* packages and skip rtracklayer + internet
>>>>>> queries entirely.
>>>>>>
>>>>>> For instance, say I have an IRanges object `r1` which is a set of
>>>>>> intervals on chromosome 1 from hg18 that I want sequence information
>>>>>> for, I could do it like this:
>>>>>>
>>>>>> R> r1
>>>>>> IRanges of length 589
>>>>>>        start      end width
>>>>>> [1]   18016124 18016155    32
>>>>>> [2]   18020749 18020780    32
>>>>>> [3]   18024599 18024630    32
>>>>>> [4]   18024830 18024861    32
>>>>>> [5]   18051985 18052016    32
>>>>>> [6]   18064760 18064791    32
>>>>>> [7]   18088145 18088176    32
>>>>>> [8]   18088200 18088231    32
>>>>>> [9]   18110085 18110116    32
>>>>>> ...
>>>>>>
>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg18)
>>>>>> R> chr1 <- unmasked(HSapiens$chr1)
>>>>>> R> myseqs <- Views(chr1, start=start(r1), end=end(r1))
>>>>>> R> myseqs
>>>>>>  Views on a 247249719-letter DNAString subject
>>>>>> subject:
>>>>>>
>>>>> TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>>>>>> views:
>>>>>>        start      end width
>>>>>> [1] 18016124 18016155    32 [GTTTTTTGTTTTGTTTTGTTTTTTGTTTTTTA]
>>>>>> [2] 18020749 18020780    32 [CGCCTAGTCTGGAGTGCAGTGGCACAATCTTG]
>>>>>> [3] 18024599 18024630    32 [TCTCACCTCTTCACCGAAGTGGTCTTCTGAAC]
>>>>>> [4] 18024830 18024861    32 [GAATTCTCCCTCGTTGCAGATCCTGTTTGAGT]
>>>>>> [5] 18051985 18052016    32 [TTATTTTTTTTGAGGAAGAGTCTTGCTCTGTC]
>>>>>> [6] 18064760 18064791    32 [GTGGGAGGATCGCTTGAGCCCTGGAGGTTGGG]
>>>>>> [7] 18088145 18088176    32 [ATCATGGTGGGAGATGAAAGGCACTTCTTACA]
>>>>>> [8] 18088200 18088231    32 [GAAGAAGCAAAAGCGGAAACCCCTGCTAAACC]
>>>>>> [9] 18110085 18110116    32 [GGATCATGAGTTCAAGAGTTCGAGACCAGCCT]
>>>>>> [10] 18118454 18118485    32 [TGTTGCCTAGGCTGGTCTCAAACTCCTGCCCT]
>>>>>> ...
>>>>>>
>>>>>> calling "as.character" on the myseqs object will give you the sequence
>>>>>> as a character vector.
>>>>>>
>>>>>> Does that get you what you're after?
>>>>>> -steve
>>>>>>
>>>>>> --
>>>>>> Steve Lianoglou
>>>>>> Graduate Student: Computational Systems Biology
>>>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>>>  | Weill Medical College of Cornell University
>>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact<http://cbio.mskcc.org/%7Elianos/contact>
>>>>> <http://cbio.mskcc.org/%7Elianos/contact>
>>>>>
>>>>>
>>>>> --
>>>>> Sabrina
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>
>>> --
>>> Sabrina
>>>
>>
> 
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list