[BioC] Bug in makeOrgPackageFromNCBI from AnnotationForge?

Tue Aug 27 02:25:18 CEST 2013

Hi Marco,

So the function you are using is downloading this file from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

And if you grab that file and grep for the lines that start with your 
tax ID you will find about 38 lines in it to indicate unique entrez gene 
IDs.  That means that we currently only have 38 gene IDs to parse from 
NCBI (at least from the files that they are giving us).  Its frustrating 
for me too that fission yeast is not better, but this is what we have.  :(

I am already planning something more general so that when this happens 
you will not be stuck using just one annotation resource for org 
packages, but unfortunately it is not going to be finished tomorrow.  
But hopefully I will have it in time to be in the next release.

   Marc

On 08/23/2013 07:24 PM, Blanchette, Marco wrote:
> I am working on a project involving Schizosaccharomyces pombe as a source for genomic analysis and love to use ReportingTools html producing wrappers. However, I am struggling as there is no AnnotationDbi package available for this organism. I decided to finally take the plunge and try to see if I could be one myself using AnnotationForge and was quite exciting to find that I could perhaps melt one simply by using the makeOrgPackageFromNCBI(). Most likely, something went wrong and I suspect a bug somewhere in the pipeline. I have not dug deeper then trying to build the package and use it hoping that someone closer to the code could shed some lights. Here the steps I took:'
>
>> library(AnnotationForge)
>> makeOrgPackageFromNCBI(version = "0.1",
>                         author = "Marco Blanchette <mab at stowers.org>",
>                         maintainer = "Marco Blanchette <mab at stowers.org>",
>                         outputDir = ".",
>                         tax_id = "4896",
>                         genus = "Schizosaccharomyces",
>                         species = "pombe")
>
> This step succeeded with only a warning:
>
> Warning message:
> In .makeSimpleTable(ug, table = "unigene", con) :
>    no values found for table unigene in this data chunk.
>
> I didn't think this was critical enough to raise any red flag, so I then proceeded with the installation that went smoothly
>
>> library(devtools)
>> install('org.Spombe.eg.db')
>> library('org.Spombe.eg.db')
> Then I try to use it with ReportingTools publish() but fail as it returns an error related to Entrez ID which I had a conversion table from biomaRt. I dug a bit deeper and found that none of the genes I was querying were in the database to finally realize that there was only 38 entries int the org.Spombe.eg.db database I had just created and installed... Check this out:
>
>> keytypes(org.Spombe.eg.db)
>   [1] "ENTREZID" "ACCNUM"   "ALIAS"    "CHR"      "PMID"     "REFSEQ"
>   [7] "SYMBOL"   "UNIGENE"  "GENENAME" "GO"       "EVIDENCE" "ONTOLOGY"
>
> Looking good! However:
>
>> length(keys(org.Spombe.eg.db,'ENTREZID'))
> [1] 38
>
> Can someone close enough to the code shed some lights has to whether there is a bug in AnnotationForge or whether it is the NCBI database that is not conforming to what is expected? For instance, biomart has 5117 entrez ID
>
>> library(biomaRt)
>> mart <- useMart("fungi_mart_18","spombe_eg_gene")
>> ensembl2entrez <- getBM(c('ensembl_gene_id','entrezgene'),mart=mart)
>> sum(!is.na(ensembl2entrez$entrezgene))
> [1] 5117
>
> The ids I tested on the NCBI website return the correct genes. However, only 10 of the AnnotationForge EntrezID (out of a skirmish 38 ids) are found in biomaRt
>
>> sum(keys(org.Spombe.eg.db,'ENTREZID') %in% ensembl2entrez$entrezgene)
> [1] 10
>
> Again, I would appreciate any comments or suggestions as to whether this is a bug or something I did wrong or a miss alignment between the NCBI S. pombe annotation and what is expected by AnnotationForge.
>
> Thanks