[BioC] Problems in AnnBuilder package function ABPkgBuilder

Asta Laiho astlai at utu.fi
Thu Apr 5 10:40:29 CEST 2007


I have been using AnnBuilder function ABPkgBuilder to create an annotation package for Affymetrix array rat2302.

I compared the package that I created (23.3)  to the rat2302 annotation package released on march 15th (Bioc 2.0) and I detected some differences between the packages. I was wondering what could cause these differences.

Here are the package information for the rat2302_1.15.13 package and my own package.

BIOC:

Quality control information for  rat2302
Date built: Created: Thu Mar 15 18:25:07 2007

Number of probes: 31099
Probe number missmatch: None
Probe missmatch: None
Mappings found for probe based rda files:
         rat2302ACCNUM found 31099 of 31099
         rat2302CHRLOC found 12177 of 31099
         rat2302CHR found 23212 of 31099
         rat2302ENTREZID found 23250 of 31099
         rat2302ENZYME found 1916 of 31099
         rat2302GENENAME found 23229 of 31099
         rat2302GO found 14228 of 31099
         rat2302MAP found 22526 of 31099
         rat2302PATH found 4535 of 31099
         rat2302PMID found 14511 of 31099
         rat2302REFSEQ found 23157 of 31099
         rat2302SUMFUNC found 0 of 31099
         rat2302SYMBOL found 23249 of 31099
         rat2302UNIGENE found 22825 of 31099
Mappings found for non-probe based rda files:
         rat2302CHRLENGTHS found 21
         rat2302ENZYME2PROBE found 586
         rat2302GO2ALLPROBES found 7649
         rat2302GO2PROBE found 5695
         rat2302ORGANISM found 1
         rat2302PATH2PROBE found 177
         rat2302PFAM found 18634
         rat2302PMID2PROBE found 24911
         rat2302PROSITE found 13246

My own package:

AnnBuilder_1.13.21
Affy: rat230_2.na22.annot.csv.zip (3/9/07)
GO: Built: 08-Feb-2007

Quality control information for  rat2302Geno
Date built: Created: Fri Mar 23 14:18:24 2007

Number of probes: 31099
Probe number missmatch: None
Probe missmatch: None
Mappings found for probe based rda files:
         rat2302GenoACCNUM found 31099 of 31099
         rat2302GenoCHR found 23101 of 31099
         rat2302GenoENTREZID found 23140 of 31099
         rat2302GenoENZYME found 1913 of 31099
         rat2302GenoGENENAME found 23119 of 31099
         rat2302GenoGO found 18000 of 31099
         rat2302GenoMAP found 22424 of 31099
         rat2302GenoPATH found 4539 of 31099
         rat2302GenoPMID found 14539 of 31099
         rat2302GenoREFSEQ found 23045 of 31099
         rat2302GenoSUMFUNC found 0 of 31099
         rat2302GenoSYMBOL found 23140 of 31099
         rat2302GenoUNIGENE found 22755 of 31099
Mappings found for non-probe based rda files:
         rat2302GenoCHRLENGTHS found 21
         rat2302GenoENZYME2PROBE found 590
         rat2302GenoGO2ALLPROBES found 8313
         rat2302GenoGO2PROBE found 6348
         rat2302GenoORGANISM found 1
         rat2302GenoPATH2PROBE found 177
         rat2302GenoPFAM found 18575
         rat2302GenoPMID2PROBE found 25257
         rat2302GenoPROSITE found 13194

So my package is built 7 days after the official packet. Yet there can be noticed some differences. Bioc package has 110 entrez ids more than my package. This is surprising since the number of found entrez ids should increase, not decrease by time to my experience. In the Bioc package there are 55 unique entrez ids more than in my package.

I use the public representative id from rat2302 Affymetrix annotation file as a primary source for the mappings and Unigene and Entrez id columns as secondary sources for the mappings, like I have been told is also done when creating the Bioc annotation package.

The most striking difference is in the GO information. Even that it is decleared in the Bioc package html info page that the same release of the GO information has been used in building it (08-Feb-2007) it seems that older version has actually been used. What else could be the explanation for that my package contains GO information for almost 4000 probesets more? When I last updated my own package in December, I also had information for 4000 probesets less.

My package is also totally missing the CHRLOC information. This, I assume, is because I get the following error message when building the annotation package: 

"Error in loadFromUrl(srcUrl) : URL ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomesrefLink.txt.gz is incorrect or the target site is not responding!"

This file does not exist on the server anymore (it was removed already last year) and I hope that AnnBuilder could soon be updated accordingly.

Another thing that I faced was a problem in the Unigene data file format. I had to remove the "//" on the last row of the file before AnnBuilder was able to process the file.

Two weeks ago (when I built the package) the url for the KEGG data was still working fine, but this week I noticed the url: "ftp://ftp.genome.ad.jp/pub/kegg/pathways" had changed to "ftp://ftp.genome.ad.jp/pub/kegg/pathway". Something else in the file structure has changed as well, since fixing just the url did not help. I hope that also this can soon be updated for the package.

I have attached below also the sessioninfo. I tested creating the same annotation package also with R 2.4.0 and previous version of AnnBuilder but the created package was identical to the one I managed to create now.

Regards,
Asta Laiho

#---------------------------------------------------------------------------------------------
> sessionInfo()
R version 2.5.0 Under development (unstable) (2007-02-11 r40701)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=
en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=
en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=
en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] "tools"     "stats"     "graphics"  "grDevices" "utils"     "datasets"
[7] "methods"   "base"

other attached packages:
 AnnBuilder    annotate         XML     Biobase     rat2302 rat2302Geno
  "1.13.21"    "1.13.6"     "1.6-0"   "1.13.38"   "1.15.13"     "1.0.0"



More information about the Bioconductor mailing list