[BioC] from RefSeq to GO terms / gene symbol to geneID

Simon Lin simonlin at duke.edu
Sat Jun 30 17:11:15 CEST 2007


If you do not have a large number of sequences, BioMart is a good 
choice. -Simon

----- Original Message ----- 
From: "Lina Hultin-Rosenberg" <lina.hultin-rosenberg at ki.se>
To: "Simon Lin" <simonlin at duke.edu>
Cc: <sdavis2 at mail.nih.gov>; <bioconductor at stat.math.ethz.ch>
Sent: Friday, June 29, 2007 12:54 AM
Subject: Re: [BioC] from RefSeq to GO terms / gene symbol to geneID


> Dear Simon and Sean,
>
> sorry to get back to this issue so late but I have tried out various 
> options to try to solve it. I parsed the files you mentioned but did not 
> get many hits since many of my proteins does not have a Entrez gene id for 
> some reason. In my search I also tried some of the Entrez e-utils 
> (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) and 
> could get the accession numbers for my proteins. Can I go from accession 
> number to GO term using biomaRt for example?
>
> Thanks again!
>
> Best,
> Lina Rosenberg
>
> Simon Lin skrev:
>> In the following two unrelated messages, both Sean and Nianhua suggested 
>> to download and parse some data tables from the NCBI. The gene_info and 
>> several other tables seems very useful. If that is the case, why not have 
>> it pre-loaded into a SQlite and distribute it as part of the annotation 
>> package for human? Simon ================= Date: Tue, 12 Jun 2007 
>> 05:59:55 -0400 From: Sean Davis <sdavis2 at mail.nih.gov> Subject: Re: 
>> [BioC] from RefSeq GI protein identifiers to GO terms To: Lina 
>> Hultin-Rosenberg <lina.hultin-rosenberg at ki.se> Cc: 
>> bioconductor at stat.math.ethz.ch Message-ID: 
>> <466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain; 
>> charset=ISO-8859-1 Lina Hultin-Rosenberg wrote:
>>
>>>> Dear list,
>>>>
>>>> This might be a question that has been discussed previously but I could 
>>>> not
>>>> find any good solution for it. I have lists of human proteins from 
>>>> various
>>>> proteomics studies that I want to compare with regards to the GO terms
>>>> associated to them. I have the RefSeq GI protein id for the proteins 
>>>> and my
>>>> questions is how I best map those to other identifiers that I can use 
>>>> in
>>>> subsequent GO analysis?
>>>> It might be that this problem is solved best outside R but maybe 
>>>> someone
>>>> still can give me a hint to the best solution. For me this is a problem 
>>>> that
>>>> comes up quite often - the need to map between different identifiers - 
>>>> and I
>>>> have not yet find any really good solution to it. If I for example use 
>>>> IPI I
>>>> always loose some proteins/genes since the coverage is rather bad, but 
>>>> maybe
>>>> there is no solution that will give perfect mapping?!
>>>
>>
>> The file located here:
>>
>> ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
>>
>> and described in detail here:
>>
>> ftp://ftp.ncbi.nih.gov/gene/DATA/README
>>
>> maps refseq to Entrez Gene ID.  Once you have the Entrez Gene ID, you
>> can use the bioconductor annotation packages to get GO mappings.  The
>> file above is a tab-delimited text file, so you should be able to read
>> it into R and do the matching by GI number rather easily.
>>
>> Hope that helps.
>>
>> Sean
>>
>> ========================
>> Message: 4
>> Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC)
>> From: Nianhua Li <nialicn at yahoo.com>
>> Subject: Re: [BioC] getting Locus Link ids from gene symbol
>> To: bioconductor at stat.math.ethz.ch
>> Message-ID: <loom.20070611T142932-100 at post.gmane.org>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Hi, Alex,
>>
>> You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
>> There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol 
>> (column 3), and Synonyms (column 5). You can:
>>
>> 1 Read in the file
>> 2 filter it based on tax_id
>> 3 match your gene symboles to the "Symbol" column and find their Gene ID
>> 4 removed the matched gene symboles from your list
>> 5 match the rest of gene symboles to the "Synonyms" column and find their 
>> Gene ID
>>
>> hope this helps
>>
>> nianhua
>>
>> Nianhua Li
>> Software Developer
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
>



More information about the Bioconductor mailing list