[BioC] GeneID -> nt sequence

Sean Davis sdavis2 at mail.nih.gov
Mon Jul 25 22:45:47 CEST 2005


On Jul 25, 2005, at 3:46 PM, Paul Grosu wrote:

>
> Hi -
>
> Anyone know how I can get the nucleotide sequence(s) to a GeneID?
>
> Here is an example:
>
> Let's say I have the following gene ID (960)  and below is the link to
> it on NCBI:
>
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? 
> db=gene&cmd=Retrieve&dopt=Graphics&list_uids=960#ubor85_RefSeq
>
> when I do the following commands:
>
> library(annotate)
> getSEQ("960")

If you look at the help for getSEQ, it takes a GenBank accession  
number, not a GeneID, as an argument (and the two are quite different).


> But these gene has several variants so I wanted to find the different
> sequence variants which I'm having trouble figuring out from the
> AnnBuilder or annotate packages.
>

If you want the sequences for a given GeneID, you are probably talking  
about RefSeq sequences.  There may be an easier way using one of the  
annotation packages, but this is pretty easy, also.  Grab the refseq to  
GeneID mapping file from the web at:

ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

gunzip the file and then load it into R (which will take a few  
minutes--big file):

gene2refseq <- read.table('gene2refseq',sep="\t",header=FALSE)

The documentation for the columns is included below.  The second column  
of the resulting data frame contains GeneID.  The fifth column contains  
the GI number for the nucleotide sequence.  Therefore, you can do  
something like:

generows <- which(gene2refseq[,2]==960)

sequences <- sapply(gene2refseq[generows,4],getSEQ)

Sequences now has 5 sequences, corresponding to the 5 RefSeq sequences  
associated with your gene.  You hopefully get the idea.  As for your  
gene having multiple isoforms, there are 5 RefSeqs for GeneID 960, so  
you will end up with 5 sequences.  I would also look at UCSC or  
EnsEMBL, both of which can find your sequences based on GeneID via the  
web in batch mode.  I have only shown one way in R to do it.

Hope this helps,
Sean





DOCUMENTATION from NCBI for gene2refseq file
======================================================================== 
===
gene2refseq
------------------------------------------------------------------------ 
---
            This file can be considered as the logical equivalent of

                     ftp://ftp.ncbi.nih.gov/refseq/LocusLink/loc2ref

            tab-delimited
            one line per genomic/RNA/protein set of RefSeqs
------------------------------------------------------------------------ 
---

tax_id:
            the unique identifier provided by NCBI Taxonomy
            for the species or strain/isolate

GeneID:
            the unique identifier for a gene
            --note:  for genomes previously available from LocusLink,
                     the identifiers are equivalent

status:
             status of the RefSeq

RNA nucleotide accession.version
            may be null (-) for some genomes

RNA nucleotide gi
            the gi for an RNA nucleotide accession, '-' if not applicable

protein accession.version
            will be null (-) for RNA-coding genes

protein gi:
            the gi for a protein accession, '-' if not applicable

genomic nucleotide accession.version
            may be null (-) if a RefSeq was provided after
            the genomic accession was submitted

genomic nucleotide gi
            the gi for a genomic nucleotide accession, '-' if not  
applicable

start position on the genomic accession
             position of the gene feature on the genomic accession,
             '-' if not applicable
             position 0-based

end positon on the genomic accession
             position of the gene feature on the genomic accession,
             '-' if not applicable
             position 0-based

orientation
             orientation of the gene feature on the genomic accession,
             '?' if not applicable



More information about the Bioconductor mailing list