[BioC] How to get NCBI's gene annotation?

Marc Carlson mcarlson at fhcrc.org
Wed Mar 18 17:50:49 CET 2009


Hi Wei,

The org packages are not about annotations from a single source.  They
are meant to provide annotations for a single organism.  And there are
many different sources that are gathered/consulted when we build the
annotation packages.  The manual pages have always provided
documentation for where the data all comes from.  And the metadata about
the origins of these different tables is also available in the databases
contained within the package.  If we were to only supply data from a
single source in an annotation package, a lot of convenience would
disappear for the users and there would be a lot less data in each
package.  Also, it would mean that to do some simple things you would
have to involve several packages instead of just one package.


  Marc

 



Wei Shi wrote:
> Hi Marc:
>
>     Can I know the reason why CHR mapping and CHRLOC mapping use
> different annotations? My personal opinion is to better use one
> annotation. If multiple annotations are to be provided, make multiple
> packages correspondingly or provide annotation options in the package.
>
> Thanks,
> Wei
>
> Marc Carlson wrote:
>> Hi Wei,
>>
>> If you read the manual pages that I mentioned in my reply, you will see
>> that the CHR mapping is always an NCBI annotation and the CHRLOC mapping
>> is always a UCSC annotation.  So it should always be possible to tell
>> what the chromosome assignments are from both sources (and whether or
>> not they agree). 
>>
>> Hope this clarifies things,
>>
>>
>>   Marc
>>
>>
>>
>>
>> Wei Shi wrote:
>>   
>>> Hi Marc:
>>>
>>>     In many cases, the extra annotation provided by UCSC is on the
>>> same chromosome with the NCBI annotation. In these cases, org.Mm.egCHR
>>> can not tell whether the annotation is from UCSC or from NCBI. Below
>>> is an example:
>>>
>>>     
>>>> mget("Gvin1", org.Mm.egSYMBOL2EG)
>>>>       
>>> $Gvin1
>>> [1] "74558"
>>>     
>>>> mget("74558", org.Mm.egCHR)
>>>>       
>>> $`74558`
>>> [1] "7"
>>>     
>>>> mget("74558", org.Mm.egCHRLOC)
>>>>       
>>> $`74558`
>>>          7          7
>>> -113043632 -113300049
>>>
>>>     Gvin1's chromosomal location is -113300049 at chromosome 7
>>> according to NCBI Entrez Gene database.
>>>
>>> Thanks,
>>> Wei
>>>
>>> Marc Carlson wrote:
>>>     
>>>> Hi Wei,
>>>>
>>>> The exact same package also provides the NCBI chromosome assignments. 
>>>> If you use the CHR mapping like this you will only NCBIs annotation and
>>>> you can see how it is different from that provided by UCSC:
>>>>
>>>> mget("21784", org.Mm.egCHR)
>>>>
>>>>
>>>> You can see where the mapping information for each mapping is coming
>>>> from by looking at the man pages:
>>>> ?org.Mm.egCHR
>>>> ?org.Mm.egCHRLOC
>>>>
>>>>
>>>>   Marc
>>>>
>>>>
>>>>
>>>>
>>>> James F. Reid wrote:
>>>>   
>>>>       
>>>>> Hi,
>>>>>
>>>>> the pointer should be for Mouse:
>>>>> ftp://ftp.ncbi.nih.gov/genomes/M_musculus/mapview/seq_gene.md.gz
>>>>> or here I believe
>>>>> ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Mus_musculus.gene_info.gz
>>>>>
>>>>>
>>>>> The reason that the org.Mm.eg.db package is giving you two locations
>>>>> is because it uses the alignment given by UCSC of the Refseq(s) of
>>>>> your gene.
>>>>> In this particular case NM_009362 aligns with 100% identity on both
>>>>> chr5:143285577-143289234 and chr17:31298341-31301998.
>>>>> By aligning this sequence by hand using BLAT you can see that the chr5
>>>>> hit appeared as of the July 2007 assembly.
>>>>> Maybe this kind of information is worth keeping in mind.
>>>>>
>>>>> Best,
>>>>> J.
>>>>>
>>>>>
>>>>> Sean Davis wrote:
>>>>>     
>>>>>         
>>>>>> On Tue, Mar 17, 2009 at 1:58 AM, Wei Shi <shi at wehi.edu.au> wrote:
>>>>>>
>>>>>>       
>>>>>>           
>>>>>>> Dear list,
>>>>>>>
>>>>>>>  The annotation package "org.Mm.eg.db" provides UCSC's annotation
>>>>>>> for mouse
>>>>>>> genes. However, this annotation could sometime be different from NCBI's
>>>>>>> annotation. Below is an example:
>>>>>>>
>>>>>>> library(org.Mm.eg.db)
>>>>>>> mget("Tff1", org.Mm.egSYMBOL2EG)
>>>>>>> $Tff1
>>>>>>> [1] "21784"
>>>>>>> mget("21784", org.Mm.egCHRLOC)
>>>>>>> $`21784`
>>>>>>>       17          5
>>>>>>> -31298340 -143285576
>>>>>>>
>>>>>>>   Two chromosomal locations were found for "Tff1" which are on
>>>>>>> chromosome
>>>>>>> 17 and chromosome 5 respectively. However, this genes is only
>>>>>>> located on
>>>>>>> chromosome 17 according to NCBI Entrez gene database. Does anybody
>>>>>>> know if
>>>>>>> there is any packages or other sources which provide NCBI gene
>>>>>>> annotation? I
>>>>>>> am working on a large set of genes and NCBI does not seem to provide
>>>>>>> downloadable files which contain gene information such as chromosomal
>>>>>>> locations etc.
>>>>>>>
>>>>>>>         
>>>>>>>             
>>>>>> Try here:
>>>>>>
>>>>>>
>>>>>> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/mapview/
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>>     [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>       
>>>>>>           
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>     
>>>>>         
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>   
>>>>       
>>
>>



More information about the Bioconductor mailing list