[BioC] how to use rtracklayer to retrieve sequence of list of coordinates

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Tue Mar 2 15:27:54 CET 2010


One thing you want to be careful with when you do stuff like this, is
that the annotation matches up _exactly_ with the genome you are
using.  This really bit me in S. cerevisiae.  Essentially even a small
indel will make everything out of phase.  And you really need to watch
out for different versions of the genome.  Sometimes extra stuff gets
resolved at the end of the chromosomes.

In this case it seems you are using UCSC annotation and BSgenome so it
should work.  But I urge you to double check.  This is something that
can be very hard to track down.

Kasper

On Tue, Mar 2, 2010 at 9:21 AM, sabrina s <sabrina.shao at gmail.com> wrote:
> Hi, Michael:
>
> See bellow:
>
>
> Error in IRanges:::.normargSEW(start, "start") :
>  'start' must be a vector of integers
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.14.2
> [3] Biostrings_2.14.5                  rtracklayer_1.6.0
> [5] RCurl_1.3-1                        bitops_1.0-4.1
> [7] IRanges_1.4.10
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.6.0 XML_2.6-0
>
>
> Thanks!
>
> Sabrina
>
>
> On Tue, Mar 2, 2010 at 8:49 AM, Michael Lawrence
> <lawrence.michael at gene.com>wrote:
>
>> Could you paste your sessionInfo() output please?
>>
>>
>> On Mon, Mar 1, 2010 at 7:37 PM, sabrina s <sabrina.shao at gmail.com> wrote:
>>
>>> Hi, Michael:
>>>
>>> I tried as you suggested, but I had the following error:
>>>
>>> Error in IRanges:::.normargSEW(start, "start") :
>>>   'start' must be a vector of integers
>>>
>>> my testTrack looks like this:
>>>
>>> RangedData with 76 rows and...
>>>          space                 ranges |        strand
>>>    <character>              <IRanges> |  <factor>
>>> 1         chr1 [137159091, 137159290] |        -
>>> 2         chr1 [155302742, 155302941] |       -
>>> etc
>>>
>>> Any suggestions? Thanks!
>>>
>>> Sabrina
>>>
>>>
>>> On Mon, Mar 1, 2010 at 6:02 PM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Mar 1, 2010 at 1:02 PM, sabrina s <sabrina.shao at gmail.com>wrote:
>>>>
>>>>> Hi, Steve:
>>>>> I am trying to get it worked without separating the chromosomes. I used
>>>>> the
>>>>> following code:
>>>>> seqQuery<-data.frame(
>>>>>        chr=currCoorList$chr,
>>>>>    start=currCoorList$starts,
>>>>>    end=currCoorList$ends,
>>>>>    strand=currCoorList$strand)
>>>>>    mySeq<- getSeq(Mmusculus, seqQuery$chr,
>>>>>         start=seqQuery$start, end=seqQuery$end,
>>>>>            strand=seqQuery$strand )
>>>>>
>>>>>
>>>>> But then I got an error:
>>>>> Error in reverseComplement(DNAStringSet(seqs[ii])) :
>>>>>  could not find function "copy"
>>>>>
>>>>
>>>> Did I miss some depending package?
>>>>>
>>>>>
>>>> Don't know what's up with that, but you don't need to use a data.frame
>>>> there. You already had a RangedData above, just pass that to getSeq, i.e.
>>>>
>>>> getSeq(Mmusculus, testTrack)
>>>>
>>>>  Sabrina
>>>>>
>>>>> On Mon, Mar 1, 2010 at 2:31 PM, Steve Lianoglou <
>>>>> mailinglist.honeypot at gmail.com> wrote:
>>>>>
>>>>> > Hi Sabrina,
>>>>> >
>>>>> > On Mon, Mar 1, 2010 at 2:04 PM, sabrina s <sabrina.shao at gmail.com>
>>>>> wrote:
>>>>> > > Hi,all:
>>>>> > > I have a long list of coordinates, which was created by the
>>>>> following
>>>>> > code:
>>>>> > > testIranges<- IRanges(start=currCoorList $starts,
>>>>> > >        end=currCoorLis$ends)
>>>>> > >
>>>>> > >   testTrack<-GenomicData(testIranges,chrom=currCoorList$chr,
>>>>> > >        strand=currCoorList$strand,genome="mm9")
>>>>> > >
>>>>> > > I could then export the testTrack into .bed file and upload in the
>>>>> UCSC
>>>>> > > genome browser and get the sequences of the listed coordinate
>>>>> > > But I was thinking if there is a function from rtracklayer to
>>>>> retrieve
>>>>> > > sequence directly.
>>>>> >
>>>>> > If you're trying to get coordinates from some reference genome, you
>>>>> > can use the BSgenome.*.* packages and skip rtracklayer + internet
>>>>> > queries entirely.
>>>>> >
>>>>> > For instance, say I have an IRanges object `r1` which is a set of
>>>>> > intervals on chromosome 1 from hg18 that I want sequence information
>>>>> > for, I could do it like this:
>>>>> >
>>>>> > R> r1
>>>>> > IRanges of length 589
>>>>> >        start      end width
>>>>> > [1]   18016124 18016155    32
>>>>> > [2]   18020749 18020780    32
>>>>> > [3]   18024599 18024630    32
>>>>> > [4]   18024830 18024861    32
>>>>> > [5]   18051985 18052016    32
>>>>> > [6]   18064760 18064791    32
>>>>> > [7]   18088145 18088176    32
>>>>> > [8]   18088200 18088231    32
>>>>> > [9]   18110085 18110116    32
>>>>> > ...
>>>>> >
>>>>> > R> library(BSgenome.Hsapiens.UCSC.hg18)
>>>>> > R> chr1 <- unmasked(HSapiens$chr1)
>>>>> > R> myseqs <- Views(chr1, start=start(r1), end=end(r1))
>>>>> > R> myseqs
>>>>> >  Views on a 247249719-letter DNAString subject
>>>>> > subject:
>>>>> >
>>>>> TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>>>>> > views:
>>>>> >        start      end width
>>>>> > [1] 18016124 18016155    32 [GTTTTTTGTTTTGTTTTGTTTTTTGTTTTTTA]
>>>>> > [2] 18020749 18020780    32 [CGCCTAGTCTGGAGTGCAGTGGCACAATCTTG]
>>>>> > [3] 18024599 18024630    32 [TCTCACCTCTTCACCGAAGTGGTCTTCTGAAC]
>>>>> > [4] 18024830 18024861    32 [GAATTCTCCCTCGTTGCAGATCCTGTTTGAGT]
>>>>> > [5] 18051985 18052016    32 [TTATTTTTTTTGAGGAAGAGTCTTGCTCTGTC]
>>>>> > [6] 18064760 18064791    32 [GTGGGAGGATCGCTTGAGCCCTGGAGGTTGGG]
>>>>> > [7] 18088145 18088176    32 [ATCATGGTGGGAGATGAAAGGCACTTCTTACA]
>>>>> > [8] 18088200 18088231    32 [GAAGAAGCAAAAGCGGAAACCCCTGCTAAACC]
>>>>> > [9] 18110085 18110116    32 [GGATCATGAGTTCAAGAGTTCGAGACCAGCCT]
>>>>> > [10] 18118454 18118485    32 [TGTTGCCTAGGCTGGTCTCAAACTCCTGCCCT]
>>>>> > ...
>>>>> >
>>>>> > calling "as.character" on the myseqs object will give you the sequence
>>>>> > as a character vector.
>>>>> >
>>>>> > Does that get you what you're after?
>>>>> > -steve
>>>>> >
>>>>> > --
>>>>> > Steve Lianoglou
>>>>> > Graduate Student: Computational Systems Biology
>>>>> >  | Memorial Sloan-Kettering Cancer Center
>>>>> >  | Weill Medical College of Cornell University
>>>>> > Contact Info: http://cbio.mskcc.org/~lianos/contact<http://cbio.mskcc.org/%7Elianos/contact>
>>>>> <http://cbio.mskcc.org/%7Elianos/contact>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sabrina
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Sabrina
>>>
>>
>>
>
>
> --
> Sabrina
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list