[BioC] problem with makeOrgPackageFromNCBI (for Chinese hamster)

Mon Aug 26 21:22:25 CEST 2013

Hi Marc,
Sorry for the delayed reply.
Meanwhile I re-ran makeOrgPackageFromNCBI() on our lunix server (after updating all packages to its latest version; initially it didn't work either...), and now the generation of the org.Cgriseus.eg.db went OK. So no complaints about removing files, etc. So this must be something Windows7 (or my machine)-specific, and is therefore less of an issue.
However, the QC function gives the same error, but that is something you apparently already identified in AnnotationDbi. BTW, does this also explain the (IMO) switched numbers?
org.Cgriseus.egGO2EG has 25227 mapped keys (of 12124 keys) --> actually means that 12,124 keys [EGIDs] (of in total 25,227) have a revmap (to GO)?

I will send you the two organism packages that I created under Windows7 and Linux in a separate mail.

Thanks for all your support! 

Guido

Linux output:
Getting data for gene2pubmed.gz
Loading required package: RCurl
Loading required package: bitops
discarding data from other organisms
Populating gene2pubmed table:
<<snip>
go_cc_all table filled
dropping table gene2pubmeddropping table gene2accessiondropping table gene2refseqdropping table gene2unigenedropping table gene_infodropping table gene2go
Making GO views 

SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.gene_name NOT NULL
SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
SELECT count(DISTINCT t.symbol) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
SELECT count(DISTINCT g.gene_id) FROM chromosomes AS t, genes as g WHERE t._id=g._id AND t.chromosome NOT NULL
SELECT count(DISTINCT g.gene_id) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT t.accession) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT g.gene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
SELECT count(DISTINCT t.unigene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
SELECT count(DISTINCT g.gene_id) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT t.accession) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
SELECT count(DISTINCT g.gene_id) FROM alias AS t, genes as g WHERE t._id=g._id AND t.alias_symbol NOT NULL
table map_counts filled
Creating package in ./org.Cgriseus.eg.db 
[1] TRUE
Warning message:
In .makeSimpleTable(ug, table = "unigene", con) :
  no values found for table unigene in this data chunk.
> 

# QC-ing:
> org.Cgriseus.eg()
Quality control information for org.Cgriseus.eg:

This package has the following mappings:

org.Cgriseus.egALIAS2EG has 25227 mapped keys (of 25227 keys)
org.Cgriseus.egCHR has 25227 mapped keys (of 25227 keys)
org.Cgriseus.egGENENAME has 25227 mapped keys (of 25227 keys)
org.Cgriseus.egGO has 25227 mapped keys (of 25227 keys)
org.Cgriseus.egGO2ALLEGS has 25227 mapped keys (of 16020 keys)
org.Cgriseus.egGO2EG has 25227 mapped keys (of 12124 keys)
org.Cgriseus.egREFSEQ has 25227 mapped keys (of 25227 keys)
Error in get(mapname) : object 'org.Cgriseus.egREFSEQ2EG' not found
>
> 
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base    

other attached packages:
 [1] RCurl_1.95-4.1        bitops_1.0-6          GO.db_2.9.0          
 [4] AnnotationForge_1.2.2 org.Hs.eg.db_2.9.0    RSQLite_0.11.4       
 [7] DBI_0.2-7             AnnotationDbi_1.22.6  Biobase_2.20.1       
[10] BiocGenerics_0.6.0   

loaded via a namespace (and not attached):
[1] IRanges_1.18.3 stats4_3.0.1   tools_3.0.1   
>
-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Marc Carlson
Sent: Friday, August 23, 2013 02:32
To: bioconductor at r-project.org
Subject: Re: [BioC] problem with makeOrgPackageFromNCBI (for Chinese hamster)

Hi Guido,

I have (so far) been unable to reproduce your initial issue here. I have no issues generating this package with either release or devel.  But even though I can't use your package directly myself, I am almost certain that your package is actually just fine, and that the only reason is says FALSE is because of the 2nd warning given (R will say FALSE when you call file.remove and it can't actually remove something).  Now the 1st warning just means that you don't have any unigene data (and that's actually good in this case, since there are no unigenes for this critter).  While the 2nd warning has to do with R feeling it is not allowed to remove the generated .sqlite file after copying it into the new package directory.  I don't know why that 2nd warning is happening on Windows and I plan to investigate it, but the crucial thing is that this happens AFTER it has already generated the package.

Looking down a bit farther you did find a problem with the
org.Cgriseus.eg() function.  Now I think that is a real bug (not a serious one, but one I intend to look into shortly), with the
org.Cgriseus.eg() function.  Basically your package does not have (and should not have) a org.Cgriseus.egREFSEQ2EG mapping, and yet this silly function is trying to ask about it.  But that is not actually a problem that exists within your package since the offending code for that actually lives in AnnotationDbi.

Now you're correct that your package does have the data that could be used for the org.Cgriseus.egREFSEQ2EG mapping, and that this data is exposed via the select method().  It is also available via the org.Cgriseus.egREFSEQ mapping.  But it is still not supposed to have that specific reverse mapping (and it also does not need it since you have a revmap() method).  In fact, none of the old mappings are really needed for anything.  We just generated a few of them for the purposes of maintaining some backwards compatibility.  And to answer your other question the package is actually "made" by just putting the database into the inst/exdata of a very minimalist package template found in AnnotationForge (you can look at in in inst/AnnDbPkg-templates/ORGANISM.DB/ if you want to see it).  The template is altered slightly based on some inputs that are generated from your initial arguments so that the manual pages etc. are all matched to the source material.  So really, the most complicated thing that happens (after the database is made) is actually just generating all the manual pages.

If you could send me a tarball for the package that you generated, I would like to look at it and verify that there are not any peculiarities with it compared to the one that I made here.

   Marc

On 08/22/2013 12:33 PM, Hooiveld, Guido wrote:
> Hi Marc and others,
>
> I am using makeOrgPackageFromNCBI() to create an annotation package for Chinese hamster (Cricetulus griseus), but experience some problems during this process. Please see code below for details. It could be very well that I miss something obvious, so any suggestion what may cause this would be appreciated!
>
> Thanks,
> Guido
>
>
> 1) I am using R on Win7, have admin rights, and also start R through 'Run as administrator'. Why can the file 'org.Cgriseus.eg.sqlite' then not be removed? (Reason 'Permission denied'). Note: I understand this is just a warning but it may be relevant.
>
> 2a) Despite no *.db package was produced, I still tried to install the database from the directory the files were generated (i.e. D:\\org.Cgriseus.eg.db). This *seemed* to go OK, but when I check they number of mapped egids it failed at the org.Cgriseus.egREFSEQ mapping...
> 2b) Interestingly, when I manually load the sqlite database (that could not be removed) these org.Cgriseus.egREFSEQ mappings are present! See code at bottom.
> 2c) --> How to make a *.db from an *.sqlite?
>
>
> # Create db0 for Chinese hamster using makeOrgPackageFromNCBI()
>> library(AnnotationForge)
>> makeOrgPackageFromNCBI(
> +           version="0.1",
> +           maintainer="Guido Hooiveld <guido.hooiveld at wur.nl>",
> +           author="Guido Hooiveld <guido.hooiveld at wur.nl>",
> +           outputDir=".",
> +           tax_id=10029,
> +           genus="Cricetulus",
> +           species="griseus")
> Loading required package: GO.db
>
> Getting data for gene2pubmed.gz
> Loading required package: RCurl
> Loading required package: bitops
> discarding data from other organisms
> Populating gene2pubmed table:
> table gene2pubmed filled
> Getting data for gene2accession.gz
> discarding data from other organisms
> Populating gene2accession table:
> table gene2accession filled
> Getting data for gene2refseq.gz
> discarding data from other organisms
> Populating gene2refseq table:
> table gene2refseq filled
> Getting data for gene2unigene
> discarding data from other organisms
> Populating gene2unigene table:
> table gene2unigene filled
> Getting data for gene_info.gz
> discarding data from other organisms
> Populating gene_info table:
> table gene_info filled
> Getting data for gene2go.gz
> discarding data from other organisms
> Populating gene2go table:
> Getting blast2GO data as a substitute for gene2go table metadata 
> filled table map_metadata filled table gene2go filled table metadata 
> filled table map_metadata filled Populating genes table:
> genes table filled
> Populating gene_info_temp table:
> gene_info_temp table filled
> Populating alias table:
> alias table filled
> Populating chromosomes table:
> chromosomes table filled
> Populating pubmed table:
> pubmed table filled
> Populating refseq table:
> refseq table filled
> Populating accessions table:
> accessions table filled
> Populating unigene table:
> Dropping GO IDs that are too new for the current GO.db Dropping GO IDs 
> that are too new for the current GO.db Dropping GO IDs that are too 
> new for the current GO.db Populating go_bp table:
> go_bp table filled
> Populating go_mf table:
> go_mf table filled
> Populating go_cc table:
> go_cc table filled
> Populating go_bp_all table:
> go_bp_all table filled
> Populating go_mf_all table:
> go_mf_all table filled
> Populating go_cc_all table:
> go_cc_all table filled
> dropping table gene2pubmeddropping table gene2accessiondropping table 
> gene2refseqdropping table gene2unigenedropping table gene_infodropping 
> table gene2go Making GO views
>
>
> SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.gene_name NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
> SELECT count(DISTINCT t.symbol) FROM gene_info AS t, genes as g WHERE t._id=g._id AND t.symbol NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM chromosomes AS t, genes as g WHERE t._id=g._id AND t.chromosome NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT t.accession) FROM refseq AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
> SELECT count(DISTINCT t.unigene_id) FROM unigene AS t, genes as g WHERE t._id=g._id AND t.unigene_id NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT t.accession) FROM accessions AS t, genes as g WHERE t._id=g._id AND t.accession NOT NULL
> SELECT count(DISTINCT g.gene_id) FROM alias AS t, genes as g WHERE t._id=g._id AND t.alias_symbol NOT NULL
> table map_counts filled
> Creating package in ./org.Cgriseus.eg.db
> [1] FALSE
> Warning messages:
> 1: In .makeSimpleTable(ug, table = "unigene", con) :
>    no values found for table unigene in this data chunk.
> 2: In file.remove(dbfile) :
>    cannot remove file 'org.Cgriseus.eg.sqlite', reason 'Permission denied'
>> # Now manually install files from DIR that has been generated.
>>
>> install.packages(repos=NULL, pkgs="D:\\org.Cgriseus.eg.db", type="source")
> * installing *source* package 'org.Cgriseus.eg.db' ...
> ** R
> ** inst
> ** preparing package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** testing if installed package can be loaded
> *** arch - i386
> *** arch - x64
> * DONE (org.Cgriseus.eg.db)
>> library(org.Cgriseus.eg.db)
>> org.Cgriseus.eg()
> Quality control information for org.Cgriseus.eg:
>
>
> This package has the following mappings:
>
> org.Cgriseus.egALIAS2EG has 25227 mapped keys (of 25227 keys)
> org.Cgriseus.egCHR has 25227 mapped keys (of 25227 keys)
> org.Cgriseus.egGENENAME has 25227 mapped keys (of 25227 keys)
> org.Cgriseus.egGO has 25227 mapped keys (of 25227 keys)
> org.Cgriseus.egGO2ALLEGS has 25227 mapped keys (of 16020 keys)
> org.Cgriseus.egGO2EG has 25227 mapped keys (of 12124 keys)
> org.Cgriseus.egREFSEQ has 25227 mapped keys (of 25227 keys)
> Error in get(mapname) : object 'org.Cgriseus.egREFSEQ2EG' not found
>>
>
>
>> #load sqlite to check that REFSEQ mappings are included
>> CHO.db <- loadDb("org.Cgriseus.eg.sqlite")
>> CHO.db
> OrgDb object:
> | BL2GOSOURCEDATE: Thu Aug 22 18:47:20 2013
> | BL2GOSOURCENAME: blast2GO
> | BL2GOSOURCEURL: http://www.blast2go.de/
> | DBSCHEMAVERSION: 2.1
> | DBSCHEMA: ORGANISM_DB
> | ORGANISM: Cricetulus griseus
> | SPECIES: Cricetulus griseus
> | CENTRALID: EG
> | TAXID: 10029
> | EGSOURCEDATE: Thu Aug 22 18:47:24 2013
> | EGSOURCENAME: Entrez Gene
> | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> | GOSOURCEDATE: 20130302
> | GOSOURCENAME: Gene Ontology
> | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godata
> | GOEGSOURCEDATE: Thu Aug 22 18:47:24 2013
> | GOEGSOURCENAME: Entrez Gene
> | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
> | Db type: OrgDb
> | Supporting package: AnnotationDbi
>
>> cols(CHO.db)
> [1] "ENTREZID" "ACCNUM"   "ALIAS"    "CHR"      "PMID"     "REFSEQ"
>   [7] "SYMBOL"   "UNIGENE"  "GENENAME" "GO"       "EVIDENCE" "ONTOLOGY"
>> keys <- head( keys(CHO.db))
>> keys
> [1] "100682525" "100682526" "100682527" "100682528" "100682529" "100682530"
>> select(CHO.db, keys=keys, cols = c("SYMBOL","REFSEQ","UNIGENE"))
>      ENTREZID SYMBOL       REFSEQ UNIGENE
> 1  100682525    P53 NM_001243976    <NA>
> 2  100682525    P53 NP_001230905    <NA>
> 3  100682526 Tuba1c NM_001243977    <NA>
> 4  100682526 Tuba1c NP_001230906    <NA>
> 5  100682527 Tuba1a NM_001243978    <NA>
> 6  100682527 Tuba1a NP_001230907    <NA>
> 7  100682528 Tuba1b NM_001243979    <NA>
> 8  100682528 Tuba1b NP_001230908    <NA>
> 9  100682529  Mgat1 NM_001243980    <NA>
> 10 100682529  Mgat1 NP_001230909    <NA>
> 11 100682530   Plec XM_003507629    <NA>
> 12 100682530   Plec XP_003507677    <NA>
> Warning message:
> In .generateExtraRows(tab, keys, jointype) :
>    'select' resulted in 1:many mapping between keys and return rows
>> sessionInfo()
> R version 3.0.1 Patched (2013-06-05 r62877)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] org.Cgriseus.eg.db_0.1 RCurl_1.95-4.1         bitops_1.0-6           GO.db_2.9.0
>   [5] AnnotationForge_1.2.2  org.Hs.eg.db_2.9.0     RSQLite_0.11.4         DBI_0.2-7
>   [9] AnnotationDbi_1.22.6   Biobase_2.20.1         BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.18.3 stats4_3.0.1   tools_3.0.1
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor