[BioC] Bug in makeOrgPackageFromNCBI from AnnotationForge?

Thu Aug 29 17:58:31 CEST 2013

Hi Marc,

Darn, you're rightŠ there is only 39 entries for S. pombe in the gene_info
fileŠ

$ wget -qO- ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz |gunzip -c |
awk '$1~/^4896$/{print $0}' |wc -l
39

And this is despite the fact that there is 13,211 gene entries associated
with this organism (taken from the taxonomy browser web page)!! Is that a
bug that should be reported to NCBI? The fact that their publicly
available data files don't reflect their internal representations is
disconcerting. I feel there something that should be done to rectify the
problem.

On the other hands, thanks a lot for exploring alternative strategies to
populate org database, let me know if you need me to beta test it before
the next release.

Marco

--  Marco Blanchette, Ph.D.
Stowers Institute for Medical Research
1000 East 50th Street
Kansas City MO 64110
www.stowers.org

On 8/26/13 7:25 PM, "Marc Carlson" <mcarlson at fhcrc.org> wrote:

>Hi Marco,
>
>So the function you are using is downloading this file from NCBI:
>
>ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
>
>And if you grab that file and grep for the lines that start with your
>tax ID you will find about 38 lines in it to indicate unique entrez gene
>IDs.  That means that we currently only have 38 gene IDs to parse from
>NCBI (at least from the files that they are giving us).  Its frustrating
>for me too that fission yeast is not better, but this is what we have.  :(
>
>I am already planning something more general so that when this happens
>you will not be stuck using just one annotation resource for org
>packages, but unfortunately it is not going to be finished tomorrow.
>But hopefully I will have it in time to be in the next release.
>
>
>   Marc
>
>
>On 08/23/2013 07:24 PM, Blanchette, Marco wrote:
>> I am working on a project involving Schizosaccharomyces pombe as a
>>source for genomic analysis and love to use ReportingTools html
>>producing wrappers. However, I am struggling as there is no
>>AnnotationDbi package available for this organism. I decided to finally
>>take the plunge and try to see if I could be one myself using
>>AnnotationForge and was quite exciting to find that I could perhaps melt
>>one simply by using the makeOrgPackageFromNCBI(). Most likely, something
>>went wrong and I suspect a bug somewhere in the pipeline. I have not dug
>>deeper then trying to build the package and use it hoping that someone
>>closer to the code could shed some lights. Here the steps I took:'
>>
>>> library(AnnotationForge)
>>> makeOrgPackageFromNCBI(version = "0.1",
>>                         author = "Marco Blanchette <mab at stowers.org>",
>>                         maintainer = "Marco Blanchette
>><mab at stowers.org>",
>>                         outputDir = ".",
>>                         tax_id = "4896",
>>                         genus = "Schizosaccharomyces",
>>                         species = "pombe")
>>
>> This step succeeded with only a warning:
>>
>> Warning message:
>> In .makeSimpleTable(ug, table = "unigene", con) :
>>    no values found for table unigene in this data chunk.
>>
>> I didn't think this was critical enough to raise any red flag, so I
>>then proceeded with the installation that went smoothly
>>
>>> library(devtools)
>>> install('org.Spombe.eg.db')
>>> library('org.Spombe.eg.db')
>> Then I try to use it with ReportingTools publish() but fail as it
>>returns an error related to Entrez ID which I had a conversion table
>>from biomaRt. I dug a bit deeper and found that none of the genes I was
>>querying were in the database to finally realize that there was only 38
>>entries int the org.Spombe.eg.db database I had just created and
>>installed... Check this out:
>>
>>> keytypes(org.Spombe.eg.db)
>>   [1] "ENTREZID" "ACCNUM"   "ALIAS"    "CHR"      "PMID"     "REFSEQ"
>>   [7] "SYMBOL"   "UNIGENE"  "GENENAME" "GO"       "EVIDENCE" "ONTOLOGY"
>>
>> Looking good! However:
>>
>>> length(keys(org.Spombe.eg.db,'ENTREZID'))
>> [1] 38
>>
>> Can someone close enough to the code shed some lights has to whether
>>there is a bug in AnnotationForge or whether it is the NCBI database
>>that is not conforming to what is expected? For instance, biomart has
>>5117 entrez ID
>>
>>> library(biomaRt)
>>> mart <- useMart("fungi_mart_18","spombe_eg_gene")
>>> ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart)
>>> sum(!is.na(ensembl2entrez$entrezgene))
>> [1] 5117
>>
>> The ids I tested on the NCBI website return the correct genes. However,
>>only 10 of the AnnotationForge EntrezID (out of a skirmish 38 ids) are
>>found in biomaRt
>>
>>> sum(keys(org.Spombe.eg.db,'ENTREZID') %in% ensembl2entrez$entrezgene)
>> [1] 10
>>
>> Again, I would appreciate any comments or suggestions as to whether
>>this is a bug or something I did wrong or a miss alignment between the
>>NCBI S. pombe annotation and what is expected by AnnotationForge.
>>
>> Thanks
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor