[BioC] Problems in AnnBuilder package function ABPkgBuilder

Nianhua Li nli at fhcrc.org
Fri Apr 6 22:58:13 CEST 2007


Hi, Asta,

> So my package is built 7 days after the official packet.

Even though the package was "published" on Marth 15, the source data
was downloaded from public data resources on Feb 28. So the
differences in source data are about 3 weeks.

> I use the public representative id from rat2302 Affymetrix
annotation file as a primary source for the
> mappings and Unigene and Entrez id columns as secondary sources for
the mappings, like I have been told is
> also done when creating the Bioc annotation package.

We used those files in the same way. The annotations for Rat230_2 on
Affymetrix site is dated 3/9/07, which was not available when we
create rat2302_1.15.13. Ours was dated 11/15/06. I guess this causes
most of the differences in probeset to Entrez mapping. The other
possible contribution is UniGene. I guess you didn't give the
"organism" argument correctly.

> The most striking difference is in the GO information. Even that it
is decleared in the Bioc package html
> info page that the same release of the GO information has been used
in building it (08-Feb-2007) it seems
> that older version has actually been used.

The gene to GO mapping information is obtained from Entrez Gene
website. The GO package is only used to get ontology category (BP, MF,
CC). I just re-created rat2302 with my local-mirror of source data
that I downloaded on Feb 28 and got the same QC result. This time I am
sure my GO version is 1.15.13. I also re-created rat2302 with the
latest source data and GO 1.15.13 and found GO mappings for 18085
probeset IDs.  So, the difference is not caused by the GO package.

> What else could be the explanation for that my package contains
> GO information for almost 4000 probesets more?

You normally want to look at the source data first. I download
gene2go.gz from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA and compared it
with the one I downloaded on Feb 28. The old gene2go.gz provide GO
mappings for 11354 rat Entrez Gene IDs, the new file covers 16759 rat
Entrez Gene IDs. 5411 rat Entrez Gene IDs only have GO mappings in the
new file. 6 rat Entrez Gene IDs only have GO mappings in the old file.

I also checked human Entrez Gene IDs. 17324 Entrez Gene IDs in the new
file, 132 of them are unique to the new file. 17281 IDs in the old
file, 89 of them are unique to the old file.

For mouse Entrez Gene IDs: 19424 IDs in the new file, 156 of them are
unique to the new file. 19354 IDs in the old file, 89 of them are
unique to the old file.

> My package is also totally missing the CHRLOC information. This, I
assume, is because I get the following
> error message when building the annotation package:
>
> "Error in loadFromUrl(srcUrl) : URL
>
ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomesrefLink.txt.gz
is incorrect or the
> target site is not responding!"

Did you spell "Rattus norvegicus" correctly?  The URL should be
something like
ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/<organism_spec_dir>/database/refLink.txt.gz
This missing of <organism_spec_dir> part in your error message
suggests that the organism value was wrong.

> This file does not exist on the server anymore (it was removed
already last year)

The file is in
ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Rattus_norvegicus/database/refLink.txt.gz

It has never been moved in the past two years.

> Another thing that I faced was a problem in the Unigene data file
format. I had to remove the "//" on the last
> row of the file before AnnBuilder was able to process the file.

Again this is because the "organism" value was wrong. This also
affects the probeset to Entrez Gene mapping process by the way.

> Two weeks ago (when I built the package) the url for the KEGG data
was still working fine, but this week I
> noticed the url: "ftp://ftp.genome.ad.jp/pub/kegg/pathways" had
changed to
> "ftp://ftp.genome.ad.jp/pub/kegg/pathway". Something else in the
file structure has changed as
> well, since fixing just the url did not help. I hope that also this
can soon be updated for the package.

Thanks for reporting. It has been fixed yesterday. Try AnnBuilder 1.13.23.

I  re-created rat2302 by using the latest data (but still the old
annotation file from Affymetrix and GO 1.15.13) and compared it with
rat2302 1.15.13. There are the differences:

    ===  rat2302PATH2PROBE
        The two  rat2302PATH2PROBE  have  176  variables in common.
         1  objects in older version only:  00904 .
n        3  objects in newer version only:  00730 04012 05213 .
        Among a random sample of  60 variables that are in common,
older version has  0  NAs, newer version has  0  NAs;  40  have
identical values in the two packages,
        and  20  have different values:  00602 05210 04520 04540 00740
... .

    ===  rat2302PMID
        Among a random sample of  60 variables that are in common,
older version has  36  NAs, newer version has  35  NAs;  55  have
identical values in the two packages,
        and  5  have different values:  1380046_at 1395212_at
1374353_x_at 1387054_at 1369818_at .

    ===  rat2302PMID2PROBE
        The two  rat2302PMID2PROBE  have  24909  variables in common.
         2  objects in older version only:  15448694 9462742 .
n        617  objects in newer version only:  10051504 10070498
10188224 10210204 10210776 ... .

    ===  rat2302GO
        Among a random sample of  60 variables that are in common,
older version has  32  NAs, newer version has  24  NAs;  29  have
identical values in the two packages,
        and  31  have different values:  1398279_at 1372378_at
1387178_a_at 1387032_at 1385140_at ... .

    ===  rat2302PROSITE
        Among a random sample of  60 variables that are in common,
older version has  24  NAs, newer version has  23  NAs;  59  have
identical values in the two packages,
        and  1  have different values:  1388028_at .

    ===  rat2302GO2PROBE
        The two  rat2302GO2PROBE  have  5686  variables in common.
         9  objects in older version only:  GO:0000126 GO:0001747
GO:0004840 GO:0006384 GO:0007222 ... .
n        686  objects in newer version only:  GO:0000003 GO:0000015
GO:0000028 GO:0000156 GO:0000224 ... .
        Among a random sample of  60 variables that are in common,
older version has  0  NAs, newer version has  0  NAs;  19  have
identical values in the two packages,
        and  41  have different values:  GO:0051056 GO:0043406
GO:0016051 GO:0042554 GO:0007274 ... .

    ===  rat2302ENZYME2PROBE
        The two  rat2302ENZYME2PROBE  have  586  variables in common.
         0  objects in older version only:   .
n        7  objects in newer version only:  2.4.1.133 2.4.1.152
2.4.99.7 2.8.1.7 2.8.2.11 ... .
        Among a random sample of  60 variables that are in common,
older version has  0  NAs, newer version has  0  NAs;  59  have
identical values in the two packages,
        and  1  have different values:  3.1.3.2 .

    ===  rat2302GO2ALLPROBES
        The two  rat2302GO2ALLPROBES  have  7642  variables in common.
         7  objects in older version only:  GO:0000126 GO:0004840
GO:0006384 GO:0007222 GO:0030236 ... .
n        687  objects in newer version only:  GO:0000015 GO:0000028
GO:0000156 GO:0000217 GO:0000224 ... .
        Among a random sample of  60 variables that are in common,
older version has  0  NAs, newer version has  0  NAs;  17  have
identical values in the two packages,
        and  43  have different values:  GO:0005234 GO:0042531
GO:0009262 GO:0008705 GO:0007130 ... .

   ===  rat2302PFAM
        Among a random sample of  60 variables that are in common,
older version has  21  NAs, newer version has  21  NAs;  59  have
identical values in the two packages,
        and  1  have different values:  1394880_at .

So, there can be a lot of differences in the data even just after 1
month.

hope this is helpful

nianhua



More information about the Bioconductor mailing list