[BioC] getSequence with biomaRt

Marc Carlson mcarlson at fhcrc.org
Fri Dec 16 00:55:08 CET 2011


Hi Michael,

I am not actually sure how you would best do this with BiomaRt, but here 
is how you might do it using one of the TxDb packages:

gids <- c('8317','1435','1063','4751','3832')

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
utrs <- fiveUTRsByTranscript(txdb)

## This then gets you a GRangesList object for the entrez gene IDs in gids.
## This object has the ranges for the UTRs that match the different 
known transcripts for each gene.
matches <- utrs[names(utrs) %in% gids]
## At this point you may decide you want more or less UTR sequence.
## You can easily adjust your ranges accordingly if you want something else.

## And once you are happy with your ranges, you can retrieve sequences 
with the help of a BSGenome package like this:
library(BSgenome.Hsapiens.UCSC.hg19)
## Then use getSeq   ( help("getSeq,BSgenome-method") )
unlist(matches)
seqs <- getSeq(Hsapiens, names=unlist(matches))
## This gives you a DNAStringSet object
seqs


Hope this helps,



   Marc




On 12/14/2011 02:14 PM, Michael Gormley wrote:
> I am trying to DNA sequence of the upstream regulatory region of a number
> of genes using the biomaRt package.  I start with a list of EntrezGene IDs
> and would like to extract the sequence from 10Kb upstream of the end of the
> 5'UTR to the end of the 5'UTR.  I wrote a neat little package of scripts to
> do this with biomaRt and export the data in .FASTA format.  I have found
> that this works well when I search for one gene at a time.  But when I
> input a list of entrez gene ids to the getSequence function it gives me
> back sequences but the sequences do not always match the answer I get when
> I search for the gene one at a time.  Sending calls to biomaRt one gene at
> a time will clearly be much slower but fast searches do me little good if I
> don't know whether the answer is correct or not.
>
> The codes I have written are pretty straight forward.  The meat of the code
> is the getSequence function which I call as follows:
>
> biomart<- useMart('ensembl')
> martDataset<-useDataset(
> dataset,mart=biomart)
> getSequence(id = '6720', type='entrezgene',seqType =
> 'coding_gene_flank',upstream = 10000, mart = martDataset)
>
> As an example, when I search the 5869 gene by itself, I get the same answer
> provided by the biomart web based tool.  When I search for the 5869 gene in
> a list like the following list, I get a different sequence.
>
> c('8317','1435','1063','4751','3832','3070','5869','675','81624','7249','1186','3801','672','1058','22974','23654','4171','1062','3148','4001','3007','26271','9314')
>
>
> Thanks for any help with this problem.  Let me know if you need more info.
> I am open to other ways of solving this problem without biomaRt
> getSequence.
>
> Thanks,
> Michael Gormley
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list