[BioC] extracting sequence from a genome

Iain Gallagher iaingallagher at btopenworld.com
Thu Mar 15 19:16:39 CET 2012



Hi Gentlemen

Thanks for taking the time. Here's where I am just now:

library(BSgenome.Rnorvegicus.UCSC.rn4)# get genome
library(GenomicRanges)

fl <- "ftp://mirbase.org/pub/mirbase/CURRENT/genomes/rno.gff" # get miR coords
gff <- read.table(fl) # as dataframe

names <- gff[,10]
nms <- gsub(";", "", gsub("\"", "", gsub("ID=\"", "", names))) # a set of nested gsub with regex to leave only the text in the double quotes

gr <- GRanges(seqnames = Rle(paste('chr', gff[,1], sep='')), ranges = IRanges(gff[,4], end = gff[,5], names = nms), strand = Rle(gff[,7]))

seqs <- getSeq(Rnorvegicus, flank(gr, 200))
names(seqs) <- nms

It's much lighter on it's feet than a loop and a nice intro to the GenomicRanges package for me.

As a follow-up question I'm going to write out the seqs object as fasta and use it in clover for TFBS analysis. 


I was wondering whether the strand is taken into account when I get the flanking sequence i.e. is the sequence returned from the + or - strand as defined in the GRanges object? 


I only ask this because presumably that will affect my TFBS analysis and if so I might want to reverse / complement all those sequences that are upstream of miRs transcribed from the - strand.


Best

iain
________________________________
From: "Tim Triche, Jr." <tim.triche at gmail.com>
To: Steve Lianoglou <mailinglist.honeypot at gmail.com> 
Cc: Iain Gallagher <iaingallagher at btopenworld.com>; bioconductor <bioconductor at stat.math.ethz.ch> 
Sent: Thursday, 15 March 2012, 17:44
Subject: Re: [BioC] extracting sequence from a genome


Yeah I just realized that when I pasted it in to execute it.  Trying to find where exactly I did this in the past (because it worked back then)




On Thu, Mar 15, 2012 at 10:42 AM, Steve Lianoglou <mailinglist.honeypot at gmail.com> wrote:

Howdy,
>
>
>On Thu, Mar 15, 2012 at 1:10 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
>> what you want to do is create a Ranges object and then get a View of the
>> sequences.
>>
>>
>> library(Biostrings)
>> library(GenomicRanges)
>> mirs <- GRanges( paste('chr', mirInfo[i,1], sep=''),
>>                 IRanges(start=mirInfo[i,4], width=1),
>>                 strand=whereverYouKeepTheStrand )
>> upstream <- flank(mirs, 200)
>> upseqs <- Views(Rnorvegicus, mirs)
>> show(upseqs)
>>
>> Untested because I am lazy as hell, but it should work, and be a LOT faster.
>
>Maybe in R 2.16 ...
>
>R> Views(Hsapiens, GRanges(c('chr1', 'chr2'), IRanges(c(6e6, 6e6),
>width=10), '+'))
>Error in function (classes, fdef, mtable)  :
> unable to find an inherited method for function "Views", for
>signature "BSgenome"
>
>;-)
>
>-steve
>
>--
>Steve Lianoglou
>Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
>Contact Info: http://cbio.mskcc.org/~lianos/contact
>


-- 
A model is a lie that helps you see the truth.


Howard Skipper



More information about the Bioconductor mailing list